MIT AI Model Learns Yeast DNA Language to Cut Drug Costs

MIT chemical engineers used a large language model to learn how industrial yeast reads DNA, then used it to make protein drugs more efficiently. The approach could help cut the time and cost of bringing new biologic medicines to patients.

A new artificial intelligence model that reads DNA like a language could help make protein-based drugs and vaccines faster and cheaper to produce.

MIT chemical engineers have adapted the same kind of large language models that power chatbots to study the genetic code of an industrial yeast widely used to manufacture medicines. By learning the yeast’s preferred patterns of DNA, the model can suggest better genetic recipes for making valuable proteins, from human growth hormone to cancer-fighting antibodies.

In lab tests, those AI-designed DNA sequences helped yeast cells churn out more of six different therapeutic proteins than sequences generated by leading commercial tools, the researchers report in a paper published in the Proceedings of the National Academy of Sciences.

For drug makers, that kind of boost could translate into shorter development timelines and lower manufacturing costs for biologics — complex medicines made by living cells that are often among the most expensive treatments on the market.

The goal is to bring more predictability to a process that is still surprisingly manual, according to senior author J. Christopher Love, the Raymond A. and Helen E. St. Laurent Professor of Chemical Engineering at MIT.

“Today, those steps are all done by very laborious experimental tasks,” Love, who is also a member of the Koch Institute for Integrative Cancer Research and faculty co-director of the MIT Initiative for New Manufacturing, said in a news release. “We have been looking at the question of where could we take some of the concepts that are emerging in machine learning and apply them to make different aspects of the process more reliable and simpler to predict.”

Learning yeast’s genetic “syntax”

Industrial yeasts such as Komagataella phaffii and Saccharomyces cerevisiae are the workhorses of the biopharmaceutical industry. They help produce billions of dollars’ worth of protein drugs and vaccines every year, including insulin, hepatitis B vaccines and monoclonal antibodies.

To turn yeast into a miniature factory for a new protein drug, engineers insert a gene encoding that protein into the yeast’s genome and then fine-tune the cells’ growth and production conditions. For biologic drugs, this development phase can account for a significant share of the overall cost of bringing a product to market.

A key design decision is how to write the DNA sequence for the gene. Proteins are built from 20 amino acids, but DNA uses 64 possible three-letter “codons” to encode them. That means most amino acids can be spelled several different ways in DNA.

Different organisms favor different codons. Traditional codon optimization tools usually pick the most common codons in the host organism, on the theory that cells are better equipped to use them. But that simple strategy can backfire. If a cell keeps seeing the same codon for a particular amino acid, it can run short on the matching transfer RNA molecules needed to assemble proteins, slowing production.

The MIT team wanted a more nuanced approach that could capture the full context of how codons are arranged in real genes.

They turned to an encoder-decoder large language model, a type of AI that normally learns patterns in text. Instead of feeding it sentences, they trained it on the amino acid sequences and matching DNA sequences for roughly 5,000 proteins that K. phaffii naturally produces, using a public database from the National Center for Biotechnology Information.

“The model learns the syntax or the language of how these codons are used,” Love added. “It takes into account how codons are placed next to each other, and also the long-distance relationships between them.”

Once trained, the model could take the amino acid sequence of a desired protein and propose a DNA sequence for K. phaffii that should produce it efficiently.

Beating commercial tools in head-to-head tests

To see how well their AI system worked, the researchers asked it to design codon-optimized genes for six different proteins, including human growth hormone, human serum albumin and trastuzumab, a monoclonal antibody used to treat cancer.

They also generated optimized DNA sequences for the same proteins using four commercially available codon optimization tools that represent different strategies for choosing codons.

“We made sure to cover a variety of different philosophies of doing codon optimization and benchmarked them against our approach,” added lead author Harini Narayanan, a former MIT postdoctoral researcher.

The team then inserted each version of each gene into K. phaffii cells and measured how much of the target protein the yeast produced. For five of the six proteins, the sequences from the MIT model led to the highest yields. For the remaining protein, the model’s design came in second.

“We’ve experimentally compared these approaches and showed that our approach outperforms the others,” Narayanan added.

Beyond the performance gains, Love emphasized the potential impact on how quickly new protein drugs can move from concept to production.

“Having predictive tools that consistently work well is really important to help shorten the time from having an idea to getting it into production. Taking away uncertainty ultimately saves time and money,” he said.

Discovering hidden biological rules

K. phaffii, formerly known as Pichia pastoris, is already used to make dozens of commercial products, including medicines and food ingredients such as hemoglobin. That made it a natural starting point for the MIT team.

But the researchers also wanted to know whether their approach could generalize to other species. They trained similar models on genetic data from humans, cows and other organisms. Each model produced different codon predictions, suggesting that species-specific models are needed to get the best results.

When the team probed how the yeast model was making its decisions, they found that it had picked up on real biological principles that were never explicitly programmed into it.

For example, the model learned to avoid certain repeated DNA elements that can interfere with gene expression. It also appeared to group amino acids based on chemical traits such as how they interact with water, reflecting underlying biophysical rules of protein structure.

“Not only was it learning this language, but it was also contextualizing it through aspects of biophysical and biochemical features, which gives us additional confidence that it is learning something that’s actually meaningful and not simply an optimization of the task that we gave it,” Love added.

Opening the toolbox

Researchers in Love’s lab have already started using the new model to design genes for proteins they want K. phaffii to produce. They have also released the code so other scientists can adapt it for their own work with K. phaffii or train similar models for different organisms.

In the long run, tools like this could become part of a broader AI-assisted pipeline for biologics manufacturing, helping scientists move from a protein idea on paper to a robust production process with fewer trial-and-error experiments.

Source: Massachusetts Institute of Technology