Article

Accelerating Metabolomics with GPUs: A Leap Into a Faster Era

By Tornike Onoprishvili, Jui-Hung Yuan, Kamen Petrov, Vijay Ingalalli, Lila Khederlarian, Niklas Leuchtenmüller, Sona Chandra, Aurelien Duarte, Andreas Bender, Yoann Gloaguen

What Is Mass Spectrometry Metabolomics, Anyway?

In untargeted mass spectrometry metabolomics, our goal is to identify as many small molecules—or metabolites—as possible from a complex biological sample, such as a plant extract. The term “untargeted” indicates that our experimental setup and instrument settings are optimized to capture signals from a vast array of molecules, rather than a predefined few. During an experiment, the mass spectrometer fragments every detectable molecule into smaller pieces and records the resulting signal. Each molecule is thus represented by its own mass spectrum—a unique fingerprint where every peak corresponds to a different fragment. Given that we don’t know (yet) which molecules we look for we want to understand the whole spectrum, in an untargeted way.

Annotated tandem MS spectrum of Epicatechin. Epicatechin is a natural product found in various fruits and known for its anti-inflammatory effects by inhibiting the NF-κB signaling pathway (Figure adapted from Wolf, S., Schmidt, S., Müller-Hannemann, M. et al. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics 11, 148 (2010). https://doi.org/10.1186/1471-2105-11-148)

Unraveling the Mystery Behind Metabolite Fingerprints

While these spectral fingerprints can tell us a lot about the composition of a sample, we usually want to know exactly which molecule they belong to. The typical approach is similar to matching a human fingerprint or DNA sample against a database. If the fingerprint matches an entry, the compound is already known; if not, you might have discovered a new molecule. Even when there isn’t an exact match, comparing the fingerprint with others in the database can provide valuable clues about its structure—kind of like finding a relative in a DNA database for an unidentified individual.

However, despite being a tried-and-true method for over a decade, the scoring functions used for matching spectra are notoriously slow; especially due to the combinatorial problem of matching spectra to possible library candidates. They often require powerful computer servers to handle large databases, which can be a significant bottleneck.

Cutting down the clock: Reducing compute time by 1700x

The workhorse behind these comparisons is a scoring method known as Cosine similarity. There are different flavours of Cosine similarity—like “greedy Cosine” for database matching and “modified Cosine” for finding related compounds. Traditionally, these calculations have been performed on CPUs, which makes large-scale comparisons incredibly time-consuming.

Enter GPUs. With the rapid advances in AI and the affordability of GPU hardware, we saw a perfect opportunity to rework this process. Our new implementation, SimMS, uses the exact same algorithm as the standard CPU-based Cosine similarity, but enables it to run on a GPU—achieving speeds up to 1,700 times faster (as benchmarked against the MatchMS implementation).

In practical terms, this means a workload that once took half a year on a CPU can now be completed in less than 2.5 hours. Imagine being able to perform 1 trillion comparisons in just 2.5 hours! This dramatic speed boost not only cuts down processing time but also slashes costs—potentially reducing your cloud computing bill by up to 99.9% when using services like AWS.

Speedup over MatchMS version of modified Cosine using different GPUs

Looking Ahead: How Speed changes the future of Metabolomics

What might seem like a technical tweak actually has the potential to transform the field of computational metabolomics. By speeding up our analyses dramatically, we can now perform routine, repository-scale comparisons and explore new methods that were once limited by speed and scale constraints. This acceleration not only makes research more efficient but also opens up exciting possibilities for discoveries - at Pangea Bio, but also beyond: in clinical diagnostics, environmental monitoring, and beyond.

Integrating Cutting-Edge Technology into Pangea Bio's Discovery Pipeline

At Pangea Bio, our focus is on turning nature's chemical diversity into innovative treatments for neurological and neuropsychiatric disorders. A critical first step in our discovery pipeline is dereplication—quickly checking if a detected compound is already known. Thanks to our new GPU-accelerated analysis, we can perform this vital step in record time. This speed enables us to efficiently rule out known molecules and concentrate on uncovering truly novel bioactive compounds. Moreover, every new discovery enriches our database, making future searches even more effective. By integrating this advanced technology into our workflow, we're not only speeding up the process but also paving the way for smarter, more efficient exploration of nature’s chemical treasures.

For those interested in the cutting edge of this technology, be sure to check out our latest research article just published in Bioinformatics at https://doi.org/10.1093/bioinformatics/btaf081.

Stay tuned as we continue to explore how these advances can unlock deeper insights into the chemical complexity of biological systems.

Latest news

Inspired by nature.
Powered by technology.
Preserving knowledge.
Empowered by nature.
Empowered by nature.
Empowered by nature.
Empowered by nature.
Empowered by nature.