PERSPECTIVES

Envisioning a Post-Assembly Era

January 3, 2024

Note: This was originally conceived as a companion to the paper “Human Genome Assembly in 100 Minutes.” At the time in 2019, some of these points were controversial, but now they seem almost passe.

Indexing-based approach to assembly reduces the algorithmic complexity from a quadratic scalability, where every read is compared to every read, to a situation in which each read is represented indexed locations of ‘minimizer k-mers’. A very good assembly can be constructed simply by computation on the indicies rather than the reads themselves.

The title is a bit provocative, since the assembly problem is not yet fully solved, but it is clear that assembly should become rapid and reliable for perfect long reads. But it begs the question – what is the assembly for. As the technology advances, publishing a de novo genome as a first look will give way to more sophisticated applications – since more will be possible and more will be expected.

In the case of pure DNA storage – e.g. for synthesized signals – the assembly problem is unnecessary since one can simply barcode the storage strands and obtain random access. Thus, it seems intuitive that there is something inherently biological about the assays that require the use of a Shimmer-based approach.

Thus, it makes sense to bring more of the sample preparation and biology into the problem, and to primarily focus on the tertiary analysis which is supported by steps of de novo sequencing. Initially, the focus might be new capability, such as the following ideas:

Pan-human genomic analysis. This implies that we eventually obtain a significant and ultimately sufficient fraction of the entire human germline, such that we expect that a typical new sample will match some inferrable combination of previously measured genomes.
Comparative, black-box, analysis of phenotype. Rather than follow the chain of information specified by the central dogma (e.g. by looking at expression transcripts) in the context of exploratory or validated mechanisms that occupy the world of proteomics, we elevate the conversation to phenotype. Specifically, we conduct a de novo differential comparion of two examples that exhibit different phenotypes, but with approximately the same developmental and temporal stimuli. This comparison would then be the start of a systems biology analysis to investigate which of the relatively specific areas of discrepancy could be a causative factor.
Full, non-incremental genomic editing. It is not typically clear what mixed effects different edits to the genome cause. With this technology, one could make larger leaps of changes and confirm the independent or mixed effects of the change – particularly in areas such as agriculture where there are cultural placeguards in check.

By Asif Khalak and Jason Chin