ETH Zurich Unveils New DNA Search Engine

ETH Zurich scientists have developed MetaGraph, a pioneering DNA search engine likened to “Google for DNA.” This innovation promises to drastically accelerate genetic research by enabling rapid and comprehensive searches of vast DNA and RNA sequence databases.

Researchers at ETH Zurich have unveiled a groundbreaking tool called MetaGraph, designed to revolutionize genetic research by enabling fast, efficient searches through vast databases of DNA and RNA sequences.

This new method, details of which are published in the journal Nature, promises to accelerate the identification of rare hereditary diseases and specific mutations in tumor cells, heralding a new era for biomedical research.

MetaGraph operates similarly to an internet search engine, allowing researchers to input a sequence of interest and rapidly locate where it has appeared in global databases.

“It’s a kind of Google for DNA,” Gunnar Rätsch, a professor in the Department of Computer Science and a member of the Biomedical Informatics Group at ETH Zurich, said in a news release.

The tool searches through the raw data of all stored sequences, bypassing the need to download extensive datasets, which was previously time-consuming and resource-intensive.

The significance of next-generation sequencing has been highlighted in recent years, particularly by enabling the rapid decoding and monitoring of the SARS-CoV-2 genome during the COVID-19 pandemic.

However, the sheer volume of data – approximately 100 petabytes stored in databases like the American Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) – presented a substantial challenge for researchers.

Now, the innovative MetaGraph tool developed by ETH Zurich scientists addresses this challenge head-on. It can compress data by a factor of 300, making it highly efficient while maintaining the integrity of the information.

“Mathematically speaking, it is a huge matrix with millions of columns and trillions of rows,” Rätsch added.

The new search engine not only simplifies and speeds up the process but also does so at a low cost. Larger queries with MetaGraph cost no more than $0.74 per megabase.

This affordability, coupled with the tool’s precision and efficiency, could significantly boost research on little-known pathogens or emerging diseases. It holds promise for advancements in antibiotic resistance research by identifying resistance genes and beneficial bacteriophages from existing databases.

“We are pushing the limits of what is possible in order to keep the data sets as compact as possible without losing necessary information,” added André Kahles, a senior scientist in the Department of Computer Science and a member of the Biomedical Informatics Group.

First presented in 2020 and continuously improved since then, MetaGraph is already available for queries and has indexed nearly half of global sequence datasets. The researchers aim to include the remaining data by the end of the year.

Being open source, MetaGraph offers vast potential benefits, including applications for pharmaceutical companies and possibly even private use in the future.

Reflecting on the tool’s future applications, Kahles added, “In the early days, even Google didn’t know exactly what a search engine was good for. If the rapid development in DNA sequencing continues, it may become commonplace to identify your balcony plants more precisely.”

Source: ETH Zurich