Unlocking Nature Through AI | AlphaFold.

The computational prediction of the 3D structure of a protein can be considered one of the main scientific breakthroughs of the 21st century. Proteins are essential molecules for life. They play crucial roles in everything from enzymatic reactions to structural support, from transport (such as hemoglobin and transferrin) to storage, and from immune defense to cell signaling and regulation (such as insulin and growth hormone).

The function of a protein is highly dependent on its structure and shape, and understanding protein folding is crucial for drug discovery, understanding diseases caused by protein misfolding, and designing vaccines against viral infections.

In the 20th century, scientists sought to understand how proteins fold into their functional shapes. In the early 1950, Linus Pauling, Robert Corey, and Herman Branson theoretically predicted the secondary structures of a protein: alpha helix and beta sheets. They used knowledge of chemical bond lengths, angles, and hydrogen bonding patterns to predict the structure. Their model later confirmed true experimentally. In 1958, the full three-dimensional structure of a protein was determined for the first time by John Kendrew and Max Perutz, who used X-ray crystallography to solve the structure of myoglobin, a globular protein found in muscle cells. This marked the beginning of experimentally determining the structure of a protein.

Since then, scientists have tried to experimentally predict the structure of a protein using X-ray crystallography, NMR spectroscopy, and later cryo-electron microscopy (cryo-EM). And in 1971, they established the Protein Data Bank (PDB), an online repository for all experimentally solved protein structures. Initially, it contained just 7 structures. But by 2016, there were roughly 120,000 experimentally determined protein structures in the Protein Data Bank (PDB). The number 120,000 sounds too big but comparing it with enormous scale of life, it’s actually a tiny fraction. A single bacterium, such as E. coli, has nearly 4000 -5000 proteins. Moreover, those experimental methods are so expensive, and it took nearly 60 years to solve 120,000 protein structures. This necessitates the development of computational predictions.

In 1994, Dr. John Moult, a computational biologist at the University of Maryland, launched a program called CASP, which is a Critical Assessment of protein Structure Prediction. The program objectively evaluates how accurately scientists can predict the 3D structure of proteins using computational methods. The program is held every two years, and this is how it works: first, Experimentalists (usually crystallographers or cryo-EM researchers) submit proteins whose structures have been solved but are not yet publicly released. Then Participating teams receive the amino acid sequences of those proteins and they must predict the 3D structures using their computational methods. Finally, they compare the computational predictions with the experimental results.

Accuracy is measured using metrics such as GDT-TS (Global Distance Test), RMSD (Root Mean Square Deviation), and TM-score. The results are then analyzed, and the top-performing methods are announced. Before 2018, the top GDT-TS scores for free modeling targets were between 40 and 50. The gold standard for GDT-TS free modeling is above 90, so pre-2018 scores were well below this standard.

In 2018, scientists at Google DeepMind developed a model called AlphaFold 1, an artificial intelligence system for predicting the 3D structure of proteins from their amino acid sequences. The model predicts inter-residue distances and torsion angles from sequence data and then uses these predictions to assemble the protein’s 3D structure.

AlphaFold 1 achieved the highest scores in the 2018 CASP13 competition, with GDT-TS scores around 60 for free modeling targets. Although this was below the gold standard of 90, the result was remarkable, as it demonstrated the potential of AI to solve complex problems in biology and nature.

In the next CASP, CASP14 (2020), Google DeepMind released its second, groundbreaking model, AlphaFold 2, with significant improvement and predictions with near-experimental accuracy. Unlike AlphaFold 1’s modular system, AlphaFold 2 used an end-to-end deep learning approach. Its key components, including the Evoformer and structure module, process sequence and pair representations to predict 3D protein structures. AlphaFold 2 achieved a GDT-TS score of 92 on free modeling targets, meeting the gold standard.

AlphaFold became open source in July 2021, allowing the scientific community to freely access its code, models, and predicted protein structures. By doing so, AlphaFold has played its role in democratization of science, and inspired a new era of AI in science.

It has reduced the time required for protein structure determination from months or years to just minutes or a few hours. This acceleration has significantly advanced biological studies, disease research, and drug discovery. Before AlphaFold, approximately 170,000 protein structures were experimentally determined over a period of about 50 years. With AlphaFold, there are now around 200 million predicted protein structures, roughly 1,000 times more than previously available.

In 2024, the Nobel Committee awarded half of the Nobel Prize in Chemistry to David Baker for his work in computational protein design, and the other half to Demis Hassabis and John Jumper of Google DeepMind for their contributions to protein structure prediction, AlphaFold.

In May 2024, Google DeepMind released AlphaFold 3, a model that expands the scope beyond just proteins. It can predict 3D structures not only of standalone proteins, but also of complexes involving proteins, nucleic acids (DNA and RNA), and ligands.

Ai is increasingly being used in every science domain to advance civilization and human consciousness. Now, it has been becoming a co-scientist in many research.