Skip to main content

Google wants to index your DNA

 

Clinical courses

 

Clinical research courses

Google wants to index your DNA

A new DNA search tool, akin to a Google for genetic material, shows great potential, as reported by its Swiss creators. In a groundbreaking study, they managed to index 10% of the globe's known DNA, RNA, and protein sequences, suggesting that this approach could feasibly expand to encompass the entirety of biological sequence data.

The scholarly paper, published on bioRxiv, utilized a novel computational tool dubbed MetaGraph, developed by the team. MetaGraph serves to arrange and condense publicly accessible sequence data into a searchable format, akin to how internet search engines operate with web pages and their content. These resultant indexes, accessible for download and through a web portal (metagraph.ethz.ch), empower users to sift through sequences encompassing trillions of base pairs and billions of amino acids.

The researchers aimed to showcase the real-world viability of indexing complete sequencing repositories, like NCBI’s Sequence Read Archive (SRA), to enable full-text search accessibility. The primary hurdle in this endeavor lies in keeping up with the rapidly expanding volume of input data. Addressing this challenge, researchers introduced MetaGraph, a meticulously crafted, scalable framework engineered to efficiently index and analyze extensive collections of biological sequence data.

They have indexed various collections of real DNA and RNA sequences, including a significant portion of all publicly available whole-genome sequencing samples from the NCBI SRA. In particular, they have indexed over 90% of all Microbe, Fungi, Plant, Human, human gut metagenome, and a substantial part of the Metazoan samples from the SRA, which together alone make up 2.6 Petabases in 1,903,327 read sets. In addition, they indexed a number of other diverse and biologically relevant data sets, from reference genomes to raw metagenomic reads.


The main challenge in the alignment regime is thus balancing the trade-off between alignment search time and sensitivity. In general, there is a trade-off between representation size and query efficiency. MetaGraph addresses this by representing input data as collections of k-mer sets stored in various succinct data structures, offering practically relevant trade-offs between index size and query performance. This flexibility allows for running MetaGraph at different scales and on different hardware, from laptops to research compute clusters to distributed cloud environments.

Researchers confident that the approaches presented in this work can be employed and integrated into the infrastructure of large data repositories, such as ENA and NCBI to make all sequence data stored in these repositories searchable, thereby providing essentially a “Google for DNA”.


For all other data sets, each sample was either transferred and decompressed from NCBI’s mirror on the Google Cloud Platform or, if not available on Google Cloud, downloaded from the ENA onto one of our cloud-compute servers and subjected to k-mer counting with KMC368 to generate the full k-mer spectrum.