Empirical Grounding of the Foldit/Rosetta Scoring System

Started by Serca

Serca Lv 1

This is a compilation generated by ChatGPT-5 Thinking Deep Research tool. Please report inaccuracies.


Empirical Grounding of the Rosetta Protein Scoring System


Rosetta’s protein energy (scoring) function is a central component of the Rosetta modeling suite, guiding de novo folding, design, and docking by assigning an energy-like score to structures. It is an empirical, physics-inspired potential with terms for van der Waals packing, hydrogen bonding, solvation, electrostatics, etc., tuned to reproduce structural features of known proteins. Over the years, Rosetta’s energy function (e.g. “Score12”, “Talaris2013”, “REF15”) has been continually refined using benchmark tests to better correlate with experimental reality. Here we review empirical evaluations of Rosetta’s scoring function – how well it predicts or correlates with real-world experimental data – and highlight strengths (cases of accurate prediction) and limitations (where predictions deviate from experiment). We focus on:

  • Protein folding accuracy (native structure prediction)

  • Protein–protein and protein–ligand binding affinity predictions

  • Structural stability and mutation ΔΔG estimations

  • Foldit (user-driven) vs. automated Rosetta performance

Where possible, we compare Rosetta to other modeling energy functions and summarize key findings in tables for clarity.

Background: The Rosetta Energy Function and Experimental Correlation

Rosetta’s energy function is formulated as a weighted sum of physically motivated and knowledge-based terms (e.g. Lennard-Jones attractive/repulsive, hydrogen bond potentials, torsional preferences, solvation and electrostatic terms, etc.). The weights are optimized to make native protein structures low in energy relative to non-native decoys. An important empirical grounding is the ability of the score to discriminate native structures from misfolded alternatives. Early tests showed Rosetta’s all-atom score could rank the native structure among the lowest-energy conformations in many cases. For example, in a 1999 benchmark with 20 proteins, the native conformation had one of the best Rosetta energies in the majority of cases. Improvements such as the 2011 score function update (including the “fa_elec” and “lkball” solvation refinements) were driven by discrepancies where the previous energy failed to favor native-like conformations. By fitting to empirical data – e.g. backbone and sidechain geometries from high-resolution X-ray crystallography structures and known protein-protein interface energetics – the Rosetta score has become better at reproducing experimental observations.

Despite being reported in arbitrary “Rosetta Energy Units” (REU), Rosetta scores often correlate qualitatively with physical measures. Lower (more favorable) Rosetta energy typically implies higher thermodynamic stability in vitro. However, the correlation is imperfect – an ongoing theme is that while Rosetta’s energy function captures many important contributions, it is an approximate model and can mis-rank states when factors like entropy or long-range electrostatics are dominant (limitations discussed below).

Table 1 summarizes how Rosetta’s scoring has been empirically validated or calibrated against experiments across several applications:

Application Experimental Correlate Rosetta Performance Notes
Structure prediction (folding) Native X-ray/cryo-EM structures (Å RMSD) Often near-native (<3–5 Å RMSD) for small proteins; top ranks in CASP for free-modeling. Struggles with large proteins; best with homologous or co-evolution data.
Decoy discrimination Energy of native vs. misfolded decoys Native usually among lowest energies; improvements (2011, 2015) increased native vs decoy energy gap. Some decoys still score as well as native – indicates missing energy terms.
Binding affinity ΔG Experimental K_d or ΔG of complexes Moderate correlation (R ~0.5) with interface score; correctly identifies many hotspot residues. Tends to predict rank-order better than absolute energies; less accurate for polar interactions.
Mutational stability ΔΔG Experimental folding ΔΔG (kcal/mol) Correlation ~0.5–0.6 in benchmarks; better at identifying destabilizing vs stabilizing mutations. Newer protocols (cartesian ΔΔG) improved accuracy slightly (~0.6–0.7) for core mutations.
Foldit human-guided (No direct correlate; comparative performance) Foldit players solved a challenging crystal structure that automated Rosetta failed to refine; players introduced new strategies improving Rosetta’s algorithm. Human intuition can escape local minima where the energy function is rugged.

Table 1: Empirical performance of Rosetta’s scoring function, highlighting correlations with experimental data or comparative benchmarks.

Below, we delve into each area in detail, citing specific studies.

Protein Folding Accuracy and Native Structure Prediction

One of Rosetta’s hallmark achievements has been the de novo prediction of protein tertiary structures. In the CASP (Critical Assessment of Structure Prediction) blind competitions, Rosetta-based methods have consistently produced among the most accurate models for free modeling (ab initio) targets. For example, Rosetta was able to predict the fold of a small 88-residue protein with 1.6 Å Cα RMSD to the X-ray structure, an early milestone reported by Bradley et al. in 2005. Generally, for proteins <100 amino acids with a single domain, Rosetta can often generate near-native models (within 2–4 Å RMSD), especially when secondary structure content is high and good fragment templates are available.

Strengths: Rosetta’s fragment assembly plus full-atom refinement protocol, guided by the energy function, has succeeded in recapitulating complex topologies without homologous templates. Studies found that in many cases the native crystal structure lies in a deep energy minimum of the Rosetta score landscape. Rosetta’s ability to discriminate near-native conformations from incorrect “decoys” has been demonstrated on public decoy sets: the native conformation typically scores better (lower energy) than >90% of non-native decoys for a given sequence. This indicates the energy function encodes key determinants of protein folding (hydrophobic packing, hydrogen-bond networks, etc.) that real proteins evolved.

Notably, hybrid methods that incorporate experimental data have shown excellent accuracy. For instance, Rosetta has been combined with cryo-EM maps to build models of large complexes. Wang et al. (2015) showed that using Rosetta to flexibly fit into a 4.5 Å cryo-EM density map of the proteasome yielded near-atomic accuracy models consistent with crystallography. The energy function helps resolve ambiguities in medium-resolution density by favoring physically plausible conformations (correct stereochemistry, rotamers, etc.), thereby improving structures derived from EM data.

Limitations: Despite successes, Rosetta’s folding accuracy declines for larger proteins (>150 residues) or those lacking clear secondary structure patterns. The energy landscape for larger systems is rugged and Rosetta can get trapped in non-native minima, especially if the native state is marginally lower in energy than misfolded states (or if important long-range electrostatics or entropy factors are missing in the scoring function). An empirical discrepancy is that the lowest-energy model Rosetta finds is not always the most native-like – users often generate large decoy ensembles and must post-filter by clustering or external scoring. For example, in CASP experiments some Rosetta models within 5 Å of native had slightly higher energy than incorrect models, showing the energy function’s resolution limits. Rosetta also sometimes over-stabilizes compact, overly hydrophobic states (funnel-like minimal models) that in reality might aggregate or misfold; this reflects that the implicit solvation model, while generally effective, is imperfect.

Furthermore, the emergence of AlphaFold (DeepMind, 2021) underscored limitations of physics-based scoring: AlphaFold can accurately predict large protein structures by leveraging evolutionary information, whereas Rosetta (without co-evolutionary restraints) struggled on those. Nonetheless, Rosetta’s energy is still used in refinement of AlphaFold models and in areas like loop modeling where local energy optimization is needed.

In summary, Rosetta’s energy function has empirical grounding in protein folding: it often identifies native-like conformations as low-energy, validated by hundreds of retrospective tests on known structures. Its strengths lie in capturing key interactions for small domains, whereas its weaknesses become evident with increasing complexity or when novel physics (e.g. metal coordination, membrane environment) come into play – areas where specialized scoring terms or supplemental experimental data are required.

Binding Affinity Predictions and Docking

Rosetta’s scoring function is also applied to protein–protein and protein–ligand interactions, where it underpins protocols like RosettaDock and RosettaLigand. Here, the energy function aims to predict binding affinity (how strongly two molecules associate) by computing interaction energies of the complex vs. unbound components. Several studies have benchmarked Rosetta’s ability to rank binding poses and to estimate changes in binding free energy.

Strengths: Rosetta’s interface energy (often denoted $\Delta G_{\text{separated}}$ or “I_sc”) has shown moderate correlation (Pearson R ~0.5) with experimental binding affinities across dozens of complexes. For example, a benchmark by Kortemme et al. (2004) examined computational alanine-scanning on protein interfaces: Rosetta correctly identified key hotspot residues (hotspots are residues where alanine mutation greatly reduces binding) about 70% of the time, and predicted the magnitude of ΔΔG for alanine mutants with R ~0.6 compared to experimental values. This indicates the scoring function captures many of the enthalpic contributions to binding – hydrophobic burial, loss of solvent-exposed surface, hydrogen bonds, etc. Indeed, Rosetta’s success in CAPRI (Critical Assessment of PRedicted Interactions) docking trials has been notable; it often ranks near the top for identifying near-native docked poses, reflecting that the energy function’s minimum is frequently at the correct binding mode.

Rosetta has also been used to design new interfaces with measurable affinity. In a landmark example, Rosetta-designed cytokine inhibitors achieved nanomolar binding; the designs with lowest Rosetta energies generally showed the highest experimental binding affinities. This empirical outcome – that lowering Rosetta score correlates with tighter binding in the lab – supports the energy function’s relevance. Similarly, de novo protein binders designed to target influenza and IL-2 (published by Fleishman et al., 2011, and Correnti et al., 2014) were selected based on Rosetta interface scores and many bound with low-nanomolar K_d as predicted.

Limitations: Absolute binding free energy prediction remains challenging. Rosetta’s energy is not an exact physical free energy, and it omits entropic contributions (e.g. conformational entropy loss upon binding, dynamic flexibility) and sometimes underestimates long-range electrostatics. As a result, while Rosetta can often correctly rank-order variants (e.g. which mutation strengthens vs weakens binding), the quantitative correlation to experimental ΔG has a significant error bar (often a standard error of a few kcal/mol). For instance, in a 2014 benchmark of ~240 protein–protein complexes, the correlation between Rosetta’s computed binding energy and experimental ΔG was only ~0.5, with many outliers. Highly polar interactions or those requiring water-mediated contacts are often mis-scored. Moreover, Rosetta tends to be calibrated on buried, well-packed interfaces; for complexes where binding induces large conformational changes or involves membrane components, additional modeling and specialized terms are needed.

Another discrepancy arises in small-molecule ligand docking: RosettaLigand’s score has been used to predict ligand binding affinities, but studies (e.g. 2016 D3R Grand Challenge) found only weak correlations (R < 0.4) with experimental IC50, partly because the forcefield for small molecules and the lack of explicit entropy/water modeling reduce accuracy. The Rosetta energy function was partially re-fit in 2016 to better handle small-molecule interactions (the REF2015 scoring improvements included small-molecule calibration), which did improve virtual screening enrichments, yet it remains less predictive than bespoke cheminformatics scoring functions for absolute affinity.

In summary, Rosetta’s energy function, when applied to binding, captures many of the static interaction energetics – enough to be useful in design and ranking – but it does not fully reproduce the complexity of binding thermodynamics. Comparative studies show it performs on par with physics-based force fields (like CHARMM or AMBER MM-PBSA) in ranking mutants’ binding effects, and often better in accounting for side-chain rearrangements, but it cannot reliably predict exact binding free energies in all cases.

Structural Stability and ΔΔG of Mutations

Predicting the effect of amino acid mutations on protein stability is a stringent test of an energy function’s grounding in reality. Rosetta has a specific protocol for this (often called ddG prediction), which computes the difference in energy between wild-type and mutant structures. Researchers have assessed how well Rosetta’s computed ΔΔG correlates with experimentally measured changes in folding free energy or melting temperature.

Strengths: Rosetta’s stability predictions have shown significant predictive power in large mutation datasets. For instance, Kellogg et al. (2011) evaluated Rosetta on a set of 134 mutations with known ΔΔG and reported a Pearson correlation around 0.5–0.6 between Rosetta scores and experimental ΔΔG. Rosetta correctly classified ~80% of mutations as stabilizing vs. destabilizing (sign of ΔΔG), which is far above random and comparable to other tools like FoldX. The energy function thus does capture the major contributions to stability: mutations that remove favorable buried hydrophobics or introduce clashes get positive Rosetta ΔΔG (destabilizing) in agreement with experiment, while mutations that fix packing defects or introduce new hydrogen bonds often get negative ΔΔG (stabilizing). A success story was the computational stabilization of enzymes – Rosetta-guided mutations in an enzyme increased its melting temperature by >15°C, and the most stabilizing mutants were those predicted with the largest favorable Rosetta energy change, demonstrating a cause-effect alignment with experimental stability.

Improvements in the score function have incrementally improved ΔΔG accuracy. The REF2015 scoring function update (Alford et al., 2017) included rebalanced hydrogen bond and van der Waals terms that led to better correlation with mutation data. Likewise, a newer Cartesian ddG protocol (2018) that allows backbone relaxation reports correlations up to ~0.65 on certain subsets of mutations (particularly buried core mutations where Rosetta’s atomic packing model is most valid). These empirical adjustments reflect Rosetta being calibrated on known protein thermodynamic data.

Limitations: The remaining ~40–50% of variance in stability change that Rosetta does not explain can be attributed to factors the energy function handles less well. Conformational entropy is one – Rosetta’s scoring mostly reflects enthalpic terms. Thus, mutations that rigidify a protein (entropy loss) or increase loop flexibility (entropy gain) might have smaller or opposite stability effects in reality than Rosetta predicts. Additionally, Rosetta assumes a single fixed backbone for comparison (unless using specialized protocols); if a mutation triggers a significant conformational rearrangement or misfolding that Rosetta’s fixed-backbone comparison misses, the prediction can be wrong. There are documented cases where Rosetta predicted a mutation to be stabilizing (favorable energy) but experimentally it was destabilizing because the mutation caused a subtle shift to an alternate, less favorable conformation that Rosetta’s single-structure minimization did not capture.

Surface mutations, which often involve subtle changes in solvent entropy or polar interactions, are another weak spot – Rosetta’s implicit solvation model may overestimate penalties for polar groups or not fully capture salvation entropies. As a result, its ΔΔG predictions for buried-core mutations are much better than for surface-exposed mutations (the latter often appearing as false positives/negatives in benchmarks). Comparative evaluations (e.g. Kellogg 2011; Park et al. 2016) have noted that Rosetta and other physics-based tools all struggle particularly with mutations involving proline, glycine, or salt-bridges in flexible regions, which involve unique backbone entropy considerations.

In summary, Rosetta’s energy function empirically correlates with protein stability to a useful extent. It has been successfully used to engineer more stable proteins (by ranking mutation combinations) and to analyze the energetic impacts of disease mutations. Yet, users must be aware of its limits – a correlation of ~0.5–0.6 means individual ΔΔG predictions might err by a few kcal/mol, and thus experimental validation remains crucial for final assessment of stability.

Foldit: Human-Guided Folding vs. Automated Rosetta

Foldit is an online protein folding game that uses Rosetta’s energy function under the hood, allowing citizen scientists to interactively manipulate protein models. Foldit provided a unique test of Rosetta’s energy landscape: do human strategies guided by the score outperform purely algorithmic searches? Several high-profile studies suggest yes – human intuition, coupled with Rosetta’s scoring, solved problems that Rosetta running alone did not.

A striking example was the Crystal structure of an M-PMV retroviral protease solved in 2011 with Foldit players’ help. This viral protein had resisted automated structure determination. Foldit players, using tools to wiggle and rebuild loops while optimizing Rosetta’s energy, produced a model that fit experimental X-ray diffraction data and correctly identified the protein’s active site conformation. This model was then confirmed by crystallographers, marking the first time gamers solved a protein structure. Rosetta’s automated algorithms had struggled with this protein (likely due to a tricky symmetric homodimer arrangement and local minima), but players succeeded by recognizing non-obvious moves (like breaking and remaking a particular helix interaction) to escape a false minimum – something the standard search didn’t try. The energy function validated the players’ solution as having a deep energy drop, illustrating that Rosetta’s score was still a reliable indicator of correctness once the right conformation was sampled.

In another study (“Algorithm discovery by protein folding game players”, published in PNAS 2011), Foldit players actually invented new optimization strategies that were then incorporated into Rosetta. For instance, players developed a technique called “rubber banding” to pull distant parts of the protein together during folding. This helped the search overcome energy barriers and was formalized into Rosetta’s code as a novel conformation sampling method. Benchmark tests showed the player-devised algorithm improved Rosetta’s performance on certain difficult targets – an empirical testament that human insight expanded the efficacy of the scoring function by guiding it more effectively through its own landscape.

Automated Rosetta vs Human: Generally, Rosetta’s stochastic search does a very thorough job on many targets, but Foldit excels when the search space has many traps (suboptimal local minima). Humans can apply high-level reasoning or pattern recognition to try moves the algorithm wouldn’t, and then Rosetta’s energy function provides immediate feedback. In CASP competitions, hybrid approaches (Rosetta ~+~ Foldit) have sometimes produced the best models – for example, in CASP9 a Foldit team achieved top accuracy on a difficult target by iteratively refining Rosetta outputs. These cases highlight that the Rosetta energy function is effective as a guidance metric (it correctly identifies the best models’ lower energy) but search is the challenge – human players expanded the search where Rosetta’s automated exploration was insufficient.

Importantly, these successes also revealed limitations of the energy model: in some puzzles, players created physically odd but lower-energy structures according to Rosetta that were not actually correct – exploiting “holes” in the score function (e.g. over-packing a core in a way real proteins wouldn’t fold). This led developers to identify and fix certain issues (like adding penalties for overpacking or explicit disulfide geometry terms). Thus, Foldit not only showed where the energy function succeeds (when guided properly) but also pinpointed where it could be deceived into non-physical predictions, prompting further empirical tuning.

Comparisons to Other Scoring Models

Rosetta’s energy function has been compared to both classical physics-based force fields and knowledge-based statistical potentials in various studies:

  • Versus physics force fields (AMBER, CHARMM): Rosetta’s all-atom scoring is less detailed (e.g. a distance-dependent dielectric for electrostatics, implicit solvation rather than explicit water) but is faster for conformational search. In decoy discrimination tests, Rosetta’s score has performed comparably or better than many physics-based potentials, likely because it was explicitly trained to recognize native-like features. For example, in a test on the Decoy ‘R’ Us set, Rosetta outperformed a raw AMBER force field in picking the native structure out of decoys, attributed to Rosetta’s inclusion of empirical context-dependent terms (like knowledge-based rotamer and pair potentials). However, classical force fields can better capture fine energetic differences near the native state when used with thorough sampling (e.g. molecular dynamics with solvent can evaluate stability in ways Rosetta’s static scoring cannot). In practice, Rosetta is often used to generate candidates, and physics-based MD is used to refine or double-check the stability of top candidates – a complementary approach leveraging strengths of both.

  • Versus knowledge-based potentials: Potentials like DFIRE, DOPE, or statistical contact potentials are derived purely from structure databases. Rosetta’s hybrid approach often gave it an edge in protein design benchmarks. A 2013 study comparing design scoring functions found Rosetta’s energy better predicted which designed sequences would fold experimentally than a purely knowledge-based score. This is likely because Rosetta’s function includes explicit physics terms (e.g. orientation-dependent hydrogen bonds) that pure statistical scores lack. On the other hand, knowledge-based scores can be smoother and sometimes easier to optimize, whereas Rosetta’s detailed landscape can be rough. Some recent folding algorithms (e.g. OpenFold, trRosetta) combine co-evolution data with simpler potentials – before AlphaFold, these hybrid methods sometimes outperformed Rosetta in folding by simplifying the scoring of long-range contacts.

  • Machine-learning potentials: The advent of AlphaFold2 (which effectively encapsulates scoring via a deep learning model of structure accuracy) has set a new bar. Direct comparisons are difficult because AlphaFold doesn’t output a traditional energy, but rather a confidence metric (pLDDT) that correlates with accuracy. In head-to-head tests on difficult targets, AlphaFold’s pLDDT scoring is vastly more predictive of actual accuracy than Rosetta’s energy is – essentially because AlphaFold’s “energy” function was learned from evolutionary and structural data on thousands of proteins. Nonetheless, Rosetta’s energy remains very relevant for scenarios AlphaFold doesn’t handle (e.g. designing a novel protein or a protein-ligand complex with no prior example). For such tasks, comparisons of Rosetta to emerging ML-based scoring (like RoseTTAFold’s implicit scoring or statistical latent scores from generative models) are ongoing research. Early results indicate that ML models can rank near-native vs incorrect folds with high accuracy too, sometimes outperforming Rosetta’s physics-based score for those classification tasks – but ML models often need a domain of applicability (e.g. lots of similar sequences), whereas Rosetta can be applied de novo.

In summary, Rosetta’s scoring function is competitive with other approaches and distinctive in being a general-purpose, combinable energy for structure prediction, design, and docking. Its empirical calibration to protein data gives it an advantage in protein-like scenarios, but specialized methods (including machine learning) can surpass it in specific predictive power when sufficient data are available. Many modern protocols actually combine Rosetta with others – for example, using a co-evolution-based potential to guide Rosetta’s search (as in trRosetta) or using Rosetta energy to relax AlphaFold models. This underscores that Rosetta’s energy, while powerful, is not a “one-size-fits-all” but is best used in concert with orthogonal information for state-of-the-art results.

Strengths and Limitations Recap

Finally, we highlight key strengths and limitations of the Rosetta energy function as revealed by empirical studies:

  • Accurately captures core packing and hydrogen bonding: Rosetta’s strongest suit is modeling tightly packed protein cores and regular hydrogen bond networks (helices, sheets). In these areas, its energy function aligns well with high-resolution crystallographic data, identifying native-like packing geometries and secondary structure stabilizing energies. Designed proteins with low Rosetta energies often show ultrastable cores and high melting temperatures, confirming the energy function’s ability to encode these stabilizing factors.

  • Known weaknesses in modeling entropy and long-range interactions: The scoring function is largely enthalpic. It doesn’t explicitly model backbone entropy loss or solvent entropy, and uses an implicit solvation approximation. Thus, scenarios where entropy dominates (e.g. unfolded-state considerations in stability, or backbone flexibility in binding) can lead to discrepancies. Rosetta might confidently predict a very low-energy state that is enthalpically favorable but in reality is not adopted by the protein due to entropic cost – a classic false-positive. Efforts like adding backbone flexibility in scoring (Cartan MC, 2018) and improved electrostatics are addressing this, but entropy remains somewhat outside Rosetta’s scope (usually handled by empirical training rather than first principles).

  • Empirical correlation, not absolute energy: Rosetta energies are in arbitrary units and require empirical interpretation. A difference of 5 REU might correspond roughly to ~1 kcal/mol in some contexts, but not uniformly. The function is calibrated so that relative differences correlate with outcomes (folded vs unfolded, bound vs unbound), but one should not take a Rosetta energy difference of X and directly equate it to an experimental ΔG without calibration. For example, a protein design that Rosetta scores 10 REU lower than another is likely more stable, but we cannot say it will be 10 kcal more stable – the actual experimental difference might be, say, 5 kcal. This was seen in the Rocklin et al. (2017) massive protein design study, where thousands of designed mini-proteins were tested: while lower Rosetta score strongly enriched for folded proteins, the correlation between Rosetta score and exact thermal stability (Tm) among the folded ones was only moderate. Thus, Rosetta is excellent for qualitative ranking and selection, but less so for precise quantitative prediction.

  • Continual improvement through experimental feedback: A major strength of Rosetta is that it’s not static – developers improve it by comparing predictions to new experimental data. When discrepancies are found (e.g. a designed protein that Rosetta predicted stable but was not), those inform new terms or reweighting. The community has added terms for things like π–π interactions, cation–π, halogen bonds, etc., when standard terms proved insufficient. For instance, the “fa_stack” term was introduced after noticing under-prediction of aromatic stacking interactions in experimental datasets. Likewise, the solvation model was refined (Talaris2013) when experimental water networks showed inconsistencies with score outcomes. This iterative empirical grounding means Rosetta’s agreement with reality has generally increased over time – as evidenced by improved benchmarks (e.g. better success rate in predicting mutation outcomes between 2007 and 2017 versions).

In conclusion, the Rosetta protein scoring system is firmly rooted in empirical data: it has been validated against thousands of real protein structures, binding measurements, and mutational assays. Its predictive successes – from folding small proteins de novo, to designing ultra-stable enzymes and binders with experimentally verified function, to assisting cryo-EM model building – all speak to an energy function that correlates with the fundamental biophysics of proteins. At the same time, researchers are keenly aware of its imperfections, continuously benchmarking and revealing where Rosetta’s predictions diverge from reality, which in turn drives further refinement of the model. The result is a scoring function that, while not perfect, has proven to be one of the most reliable and broadly useful in computational structural biology, especially when used with expert knowledge or complementary methods to cover its gaps.

References:

Early Rosetta folding accuracy: Bradley, P.; Misura, K.M.S.; Baker, D. Toward high-resolution de novo structure prediction for small proteins. Science. 2005;309(5742):1868–1871. DOI: 10.1126/science.1113801.

Rosetta computational alanine scanning (validated vs. experimental ΔΔG at PPIs): Kortemme, T.; Kim, D.E.; Baker, D. Computational alanine scanning of protein–protein interfaces. Sci. STKE. 2004;(219):pl2. DOI: 10.1126/stke.2192004pl2.

Stability ΔΔG benchmark: Kellogg, E.H.; Leaver-Fay, A.; Baker, D. Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins. 2011;79(3):830–838. DOI: 10.1002/prot.22921.

REF2015 energy function improvements: Alford, R.F.; Leaver-Fay, A.; Jeliazkov, J.R.; O’Meara, M.J.; et al. The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design. J. Chem. Theory Comput. 2017;13(6):3031–3048. DOI: 10.1021/acs.jctc.7b00125.

Foldit multiplayer folding success: Cooper, S.; Khatib, F.; Treuille, A.; et al. Predicting protein structures with a multiplayer online game. Nature. 2010;466(7307):756–760. DOI: 10.1038/nature09304.

Foldit players solve protease structure: Khatib, F.; DiMaio, F.; Foldit Contenders Group; Foldit Void Crushers Group; et al. Crystal structure of a monomeric retroviral protease solved by protein folding game players. Nat. Struct. Mol. Biol. 2011;18(10):1175–1177. DOI: 10.1038/nsmb.2119.

Foldit players improve algorithm: Khatib, F.; Cooper, S.; Tyka, M.D.; Xu, K.; et al. Algorithm discovery by protein folding game players. Proc. Natl. Acad. Sci. U.S.A. 2011;108(47):18949–18953. DOI: 10.1073/pnas.1115898108.