The Living We Form
Each of us contributes to society in ways that only ourselves have the highest resolution of, and the Universe that of the rippling impacts of our actions. Ants, much like the individual cells of an organism, are born into this world with infinite potential. It is only through unified growth in their colony that they learn to define it through the experiences which subtly and precisely shape their epigenetic. Between a seedling and a tree that reaches the sky, it is not only the stochasticity of inheritance that drives fate; it can simply come down to who received more light. Do you know the unfathomable order and chance that gave rise to your body, its breath, or even its capacity to digest water and make use out of it? Evolutionary timescales do not make sense to us, it is normal.
Still, we are here, our sudden appearance bringing self-awareness to a world that now just begins to write its history, reflecting on what to do about itself. Perhaps it is rarity that makes it a privilege, the infinitesimally small probability that we breathe the same air, our hearts ceaselessly beating from an intergenerational chain of selfless nourishment. We may connect, retrospectively, to the single organism from which we descended; the first molecule whose energy, in indescribable ancestral conformation, started the cycle of replication. If you ask yourself why we seek to understand how each cell of our body contributes to life or disease, ask yourself why governments seek to understand each and everyone of their citizens: to have control on something we may never totally understand. Even if we will remain forever wondering in awe, the preciousness and fragility of life requires us to lead it and examine it as one, with compassionate curiosity.
Each cell in our bodies contribute to the homeostasis of our inner ecosystem. Single-cell RNA sequencing provides one of the clearest views into cellular individuality by capturing each cell’s gene expression profile or transcriptome. This method helps reveal which genes are active in any given cell; extract many from an organism at different stages of its phenotypic evolution and you can, thanks to differential expression analysis, understand the orchestration of upregulation and downregulation so essential at producing the right proteins at the right time. With even greater resolution of this phenomenon, one may also notice the opening of heterochromatin into euchromatin by running an ATAC-seq assay at the cellular level; such methods have for instance allowed the discovery of pioneer transcription factors which have the capacity to bind regulatory regions even when tightly wound up around histones. Their expression activity influenced by various enzymatic protein complexes, cells may also have their transcriptome analyzed jointly with their affiliated proteome using CITE-seq; I suppose such technologies may have been essential in uncovering the mechanism of infection of pathogens alike the coronavirus which utilizes cell-surface proteins to inject its viral DNA. But of course, most surely want to know where all of this takes place in a given tissue, organ, and living being in order to perform personalized and precise therapeutic interventions. Technologies like SeqFish+, allow for the spatially resolve quantification of a given cell’s and all its neighbour’s transcriptomic activity which in unison give rise to the concerning macroscopic which prompted microscopic investigation.
To understand the behavior of billions of individuals is presently unfeasible. Even at the cellular level, our current sequencers are hardly scalable to the entirety of the cells that compose our body. Some researchers estimate that there are more neurons in your head than there are stars in the milky way. The resulting data is colossal and even as we incrementally generate it, its analysis is filled with the technical biases of batch effect which may divert our algorithms from capturing biologically relevant information from the signals of our ever-evolving sequencing platforms which are by no means perfect. Imagine creating a machine that characterizes each grain of sand in a dune, attempting to understand how each pushes the other, reflects light, absorbs scorching heat and releases it in the cold night. You are bound to miss something, stuck with a resulting level of sparsity in your data; dropout events are exactly this. Our current technology can only sample a finite amount of RNA from a given cell which may itself lowly express many of its genes which we are interested in sequencing but are very unlikely to have sampled leading us to believe it is silent while in fact, it is not. Nanopore and Illumina sequencers carry inherent error rates on their own and that is without even thinking about sample quality defined by the naturally rapid degradation of ribonucleic acids. Adding spatial and proteomic modalities make the platform even more prone to capture noise.
But let’s say your data is perfectly capturing the activity of neuronal cells in a biopsy of an individual’s brain attained of Alzheimer’s; which machine learning model do you use to make sense of it? I do not see a close future, even with the current growth rate of compendia, where foundation models of transcriptomic activity are created and sufficiently generalizable to provide biologically relevant conclusions without excessive fine-tuning and domain-specific input. I say this as our primary goal is to understand the individual cellular interactions and expressions that lead to phenotype; such ambitious project seems to intuitively integrate into the perspective of a model that is trained to holistically capture the characteristic behavior of each of our cell types. Unfortunately, state-of-the-art computational modelling often has relied on hardly interpretable architectures and hardly explainable predictions; the black box which we will now shed light upon. Let us start with the greater picture.
An autoencoder is a biphasic computational neural architecture which firstly encodes highly dimensional data by compressing it to a lower dimensional representation called the latent space. This representation contains the distilled information rich patterns necessary to reconstruct the original data from it using a second neural network called the decoder. Much like generative models which can increase the sharpness or inherent resolution of an otherwise noisy picture using residual NNs or even diffusion models, autoencoders can be trained to impute missing values from sparse single cell sequencing data. Deep count and variational inference autoencoders can both help researchers make use of noisy and sparse platform readouts by first learning to distill patterns of expression and being trained to accurately reconstruct them using zero-inflated negative binomial regression or ELBO-based probabilistic frameworks respectively. The SAUCIE architecture, based on a deep neural autoencoder performs denoising and imputation of dropout events with the added benefit of correcting batch effects for large scale datasets likely containing technical replicates; with added clustering visualizations generated from the intrinsic dimensionality reduction occurring in the information bottleneck of the architecture, SAUCIE provides a single framework capable of multiple essential tasks for understanding single cell sequencing data.
Specialists in mitigating batch effects are often clustering algorithms like MNN which finds pairs of mutual nearest neighbours in different batches which may be predominantly infused with information of technical differences significant enough to deter a traditional algorithm like PCA to relevantly regroup data points with respect to cell type. Thanks to this novel algorithm, data points which capture biological similarity are brought closer together into a cluster which simultaneously is pushed further from biologically dissimilar data points of its own batch. Canonical Correlation Analysis attempts to mitigate this same batch effect by performing dimensionality reduction on samples. With millions of samples to process, the deep learning architectures we have discussed so far benefit immensely from the subsampling (i.e. mini batch) optimization that stochastic gradient provides; the scalability of their training relies on it as each epoch trained on an otherwise full cell population sampling would be so computationally expensive that to arrive at an optimum would most likely be absurdly long.
In order to interpret results of clustering, to understand what functional characteristic binds cellular data points together, methods such as non-negative matrix factorization and topic modelling have been borrowed to uncover the functional and cellular identity of biological patterns. Making learned transcriptomic patterns purely positive, as seen with the decomposition of Eigenfaces, the latent factors or cellular groupings generated by NMF can be directly interpreted as being bound by genetically distinctive co-expressions which may help researchers precisely identify what cellular function the grouping represents. The same principle is also leveraged with Embedded Topic Models whereby cells with similar expression profiles are clustered together around a latent topic characteristic of a functional role, as described by the co-expressed genes that define it (e.g. the topic Alzheimer’s may be characterized by an expression profile where APP and APOE are prevalent).
A goal of clinical interest would be to develop multimodal models capable of integrating ATAC-seq, CHIP-seq, RNA-seq and any informative omic- modalities for billions of sampled cells. Such foundation models could operate using a transfer learning paradigm progressively allowing them to integrate knowledge through fine tuning. Promising candidates may use contrastive learning whereby transformer-generated embeddings of gene expression and say chromatin accessibility could be aligned together at the cellular level producing clusters which are successfully grouped by biological function instead of modality, thus avoiding batch effect.