
With AI, researchers predict the location of virtually any protein within a human cell
In a significant stride for biomedical research, scientists from MIT, Harvard University, and the Broad Institute of MIT and Harvard have unveiled a groundbreaking AI-powered computational method capable of predicting the precise location of virtually any protein within a human cell. This advancement, detailed in a paper published in Nature Methods, promises to revolutionize our understanding of cellular function and accelerate the diagnosis and treatment of various diseases.
Proteins are the workhorses of the cell, but their proper function is heavily dependent on their exact subcellular location. When proteins are misplaced, they can contribute to severe conditions like Alzheimer’s, cystic fibrosis, and cancer. The sheer complexity of the human proteome, with approximately 70,000 different proteins and their variants in a single cell, has made manual identification of their locations an immensely costly and time-consuming endeavor. Traditional methods allow scientists to test only a handful of proteins per experiment.
Existing large datasets, such as the Human Protein Atlas, which meticulously catalogs over 13,000 proteins across more than 40 cell lines, have only scratched the surface, exploring roughly 0.25 percent of all possible protein-cell line pairings. Recognizing this monumental challenge, the researchers developed a novel computational approach, dubbed PUPS (Prediction of Unseen Proteins’ Subcellular Location), designed to efficiently navigate the remaining uncharted territory.
What sets PUPS apart is its unprecedented ability to predict the location of any protein in any human cell line, even those that have never been previously tested. Furthermore, unlike many AI-based methods that provide an averaged estimate across cell types, PUPS localizes a protein at the single-cell level. This granular precision could be crucial for pinpointing protein locations in specific disease cells, such as cancer cells after treatment.
“You could do these protein-localization experiments on a computer without having to touch any lab bench, hopefully saving yourself months of effort,” explains Yitong Tseo, a graduate student in MIT’s Computational and Systems Biology program and co-lead author of the research. “While you would still need to verify the prediction, this technique could act like an initial screening of what to test for experimentally.”
PUPS operates through a sophisticated two-part model. The first component utilizes a protein language model that captures the localization-determining properties of a protein, including its 3D structure derived from its amino acid sequence. The second part incorporates an image inpainting model, a type of computer vision model that analyzes three stained images of a cell (for the nucleus, microtubules, and endoplasmic reticulum) to glean vital information about the cell’s state, type, and individual characteristics.
These two models collaborate, joining their representations to predict the protein’s subcellular placement. The output is an intuitive highlighted image of the cell, indicating the model’s predicted protein location. This innovation holds immense potential for helping researchers and clinicians more efficiently diagnose diseases, identify potential drug targets, and deepen biological understanding of how protein localization underpins complex cellular processes.
Xinyi Zhang, a co-lead author and graduate student in EECS and the Eric and Wendy Schmidt Center at the Broad Institute, highlights PUPS’s unique generalization capability: “Most other methods usually require you to have a stain of the protein first, so you’ve already seen it in your training data. Our approach is unique in that it can generalize across proteins and cell lines at the same time.” This means PUPS can even account for localization changes caused by unique protein mutations not present in existing databases.
The researchers employed clever training strategies to ensure PUPS’s robust performance. This included assigning a secondary task of explicitly naming the cellular compartment of localization, alongside the primary image inpainting task. This dual approach enhanced the model’s overall understanding of cell compartments. Training PUPS simultaneously on various proteins and cell lines also helped it develop an innate comprehension of protein localization patterns within cell images, even independently discerning how different parts of a protein’s sequence contribute to its overall position.
Laboratory experiments validated PUPS’s accuracy in predicting subcellular locations for new proteins in previously unseen cell lines. Compared to a baseline AI method, PUPS consistently demonstrated lower prediction error across tested proteins.
Looking ahead, the team aims to further enhance PUPS to understand protein-protein interactions and make predictions for multiple proteins within a single cell. The long-term vision is to extend PUPS’s capabilities to predict protein localization within living human tissues, moving beyond cultured cells.
This groundbreaking research received vital support from the Eric and Wendy Schmidt Center at the Broad Institute, the National Institutes of Health, the National Science Foundation, the Burroughs Welcome Fund, the Searle Scholars Foundation, the Harvard Stem Cell Institute, the Merkin Institute, the Office of Naval Research, and the Department of Energy.



