Testament of a Bio-Informatician

Yanis Bencheikh
6 min readOct 28, 2024

--

A Bright Green Plant Growing From a Tree Trunk

The hope carried by genome language models (GLMs) was that of a day where we could understand the vocabulary of nature, the functional and regulatory grammar of creation which genes are the carriers we have been studying, for the past century. Such foundational computational models, we believed, could provide enhanced predictive power thanks to their extensive preliminary training. However, even with vast state-of-the-art infrastructures theoretically allowing the capture of genome wide patterns, specialized expert models which lack pre-training still outperform GLMs in various tasks. For decades, expert models have been accompanying healthcare professionals and biomedical researchers in anamnesis and statistical analysis of patient data. Personalized medicine was drastically facilitated by machine learning, and it has reciprocally motivated the development of its facilitator notably by inspiring the birth of seminal tools like the Hyena mechanism whose impact reaches far beyond genomics.

Servers

With millions of years of evolution, an unfathomable story of 3.2 billion characters written by innumerable epigenetic factors and environmental forces which we barely understand, our history, I fear, can presently only be grasped by the ones who possess the capital necessary; not people whom we have elected nor can believe for sure will forever read benevolently. Where we learn to do so, where we protect the heritage of our ecology, our university, barely has the power to equalize the most financially powerful of our economy. Would it direct its attention to the right words if it had the money? I believe so, because I believe in our student body, and I believe that in the difficulty of survival, we are, humanly, more likely to sacrifice what we shouldn’t, than in arrival to what is greatest : the serenity of knowing we aspire for the same good : the good of all. I am made by my university, only so far that I am made by the good it teaches me. I have hope that the immense beauty of our universe, the ideas of proximity we share with the same air, will make us learn, care and heal together in indissociable spirit of compassion.

Sorry for deviating.

A Woman Looking at a Sculpture

What I was getting at is motivation and challenge. We want to offer the highest quality personalized, data-informed, and trauma-informed care. We know that machine learning promises a useful gateway to offering it, but it requires tremendous computational power applied to tremendous amounts of data in a most likely federated effort of learning in order to ensure maximal generalizability to our diverse communities. Our research community has been actively working on the development of models capable of handling ever longer genetic and natural language corpora. Amongst them, we have the ones relying mainly on convolution and transformers which may or may not replace their attention mechanism for it. From classic Enformers to the promising Hyena architecture which elegantly reduces cost using fast Fourier transformations, each approach has been a stepping stone towards understanding ourselves, quite literally.

Much like in a novel, a single word or k-mer of nucleotides may be powerfully linked in context or regulation to another passage written far before or after itself. In the multidimensional space of meaning which our brain generates, in the depths of a convolutional neural network’s layers, we can learn to understand the intricate torsion of chromatin which brings enhancer to enhanced gene. To keep track of genomic context long of thousands of nucleotides, hybrid models leveraging the increasingly greater receptive field of deep CNN units were fused to long short-term memory architectures. Bidirectional in the context of DanQ, the LSTM with its characteristic modulatory gates and recurrent processing of information set the stage for innovative capture of textual context which quickly took the form of the transformer.

Much more scalable than its CNN-based predecessor, transformer architectures such as BERT and GPT benefit from extensive pre-training (i.e. masked language modelling and next token prediction respectively), a capacity for transfer learning, and most notably, the self-attention mechanism. From there, cumulative innovations such as DNABERT were born from training on the entirety of the human genome leading the way for fine-tuning on tasks of promoter, transcription factor binding and even splicing site prediction. The quadratic cost of the revolutionary attention layers, however, had limited the number of already small k-mer tokens it could process in sequence. Then came the nucleotide transformers (e.g. the NT- series) who with nucleotide-level resolution took many folds more input tokens to process through millions of parameters learned from now multi-species genome pre-training. In multiple splice site and chromatin accessibility prediction tasks, NT-multispecies outperforms its competitors with unparalleled modeling of regulatory networks, but still remains bound to the lingering computational cost of scaling.

Using clever factorization to linearize the computations enabling attention, models such as LineFormer have successfully reduced time complexity to O(2nd2) whereby the classic sub-quadratic cost dependent on sequence length is now dependent on the often smaller embedding size d; a more than welcome change that promises processing of whole genomes to be one day tractable. Attention turned out not to be all we needed; perhaps its quadratic cost was only foreshadowing for the creators of the Hyena mechanism, a sub-quadratic O(Llog2L), paradigm-breaking alternative for capturing sequence context and dependencies spanning record lengths — it started with CNNs for a reason, I suppose. Leveraging chains of implicit long convolutions with costs reduced by mathematical manipulations allowing the use of FFTs, the HyenaDNA foundational model can process input sequences of up to a million nucleotides. Demonstrating state-of-the-art benchmarking performances, it reached even greater heights while preserving scalability and computational accessibility by using k-shots learning.

A Hyena Gazing Directly at the Camera Photographying It

As single-nucleotide resolution models become ever more scalable and predictively accurate, our understanding of enhancer, promoter and transcription factor interactions which finely orchestrate our homeostasis becomes clearer. Each note struck to form the harmony of our being, each heterochromatin rhythmically opening to allow the expression of the right gene at the right time, can now reach our awareness. In parallel to awe-filled observation, we can now let our healing ambition be informed by the computational generation of genetic language; summarize a long gene to its most probable “purest” and shortest form of functional expression; learn from sequencing-derived differential expression data to create a repertoire of cell-type-specific regulatory sequences; classifying raw sequences by biotype or even species, solely from attended embedding.

To know how to orchestrate regulation, is to know how to heal in an event of natural imbalance.

Someone Teaching Piano to a Child

Sometimes, we may have done everything in our locus of control to change our prognostic trajectory in the right direction, but still without being enough to transcend current illness. Personalized medicine assisted by machine learning can help us generate the most probably useful and potent medication needed on our unique journey of healing. Nature, through spans of time we cannot grasp, has created innumerable organisms, teachers, whom we can learn from and that have yet to be discovered. We simply have to allow ourselves to accord attention, compassion and power to listen to them. On our journey, when storms of uncertainty rage upon the waves of self-realization we too rapidly traverse without appreciation, we may hear them reminding us, lonely, that in our fear it is best to breathe and search for help. Perhaps, it will become evident that the solution to our despair was in front us waiting all along. There is only sadness from those whom you love, when you navigate by the sole strength of your arms, when you knew you could have lighted engines.

A Tall Rock Formation in the Middle of the Sea

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response