Cell Count Threshold Selection

Before variants of interest can be selected, a threshold for the number of sample cells for potential variants of interest. In order to get a rough idea of the cell count vs variant count distribution (i.e. how many variants have a given number of) cells, I first plotted the the distribution of the number of sample cells for each variant. In order to do this as well as future analysis, I first generated a "variant count file" using the data in DINO_ViT_genotypes_PCA12_chsep_ImageNet3channel_scaled_bothreps_053124.csv . This was generated using fisseqtools.feature_selection dump_barcode_count.

Varient Count Graph

As expected, the cell counts are roughly exponentially distributed. Next, in order to select the actual variant cell count threshold I plotted the total variant count as well as the total cell count over potential cell count thresholds. These graphs were generated using fisseqtools.feature_selection graph_cum_cell_variant_count.

Variant Count Graph Variant Count Graph

A table of selected potential thresholds is also available below. This table was also generated using fisseqtools.feature_selection graph_cum_cell_variant_count.

Cell Count Threshold Number of Variants Number of Cells
236112361
14402135990
12834264520
11596490931
105887116416
1004108138030
952130159396
911155182730
866184208434
829216235426
799242256534
768272279941
741305304846
712340330274
689380358324
666420385449
644460411601
621513445045
598560473760
576626512412
556695551384
535765589488
515849633556
494950684432
4741054734804

From the table above, a cell count threshold of 500 seems like a reasonable choice. Unfortunately the cell embeddings contain 1536 total features (4 channels x 384 features), so it is not possible to select a threshold that will allow the training of a great classifier for each potential variant of interest. However, an algorithm like random forest or logistic regression should be able to select out the important features.

Repeating the Analysis, but with Genotype this Time (Oops)

Turns out the barcodes don't correspond one to one to the genotype as I initially thought, whoops. Therefore, made some slight modifications to fisseqtools.feature_selection and repeated the previous analysis. I also filtered out any wildtype and synonymous genotypes from the analysis this time around.

Corrected Variant Count Graph Corrected Variant Count Graph Corrected Variant Count Graph
Cell Count Threshold Number of Variants Number of Cells
620716207
33162698747
284752177338
261583262752
2467110331035
2328137395527
2267165459930
2182196528687
2107226593029
2014256655130
1936285712106
1868319776747
1801352836951
1729382889836
1670418950931
1614446996831
15624891065073
15145191111266
14635581169295
14225931219704
13686271266912
13206601311228
12767041368266
12337421415759
11967791460715

In addition to the counts above, there are also 390283 wild type samples in the dataset, which is far more samples than any single non-synonymous variant. From the table above, a cell/sample threshold count of 2000 seems like a reasonable choice.