The problem of protein design

Started August 24, 2018 by bkoep

bkoep Staff Lv 1

August 24, 2018

This is the first of a three-part blog post. In the first part, we’re going to review the concept of energy landscapes, which some of you may already be familiar with. In the second part, we’ll discuss how a concept from physics, called a partition function, can help us think about energy landscapes. In the last part, we’ll propose a way that we might use these concepts of energy landscapes and partition functions to improve protein design in Foldit.

The energy landscape

There’s a problem with the way we currently design proteins in Foldit—and not just in Foldit, but also in Rosetta. In fact, it’s a problem in any protein design strategy that optimizes the absolute energy of the design. This strategy is the premise of a Foldit design puzzle. The Foldit score measures the absolute energy of a solution (with a negative multiplier), so that when players compete to find solutions with the highest score, they are actually competing to find solutions with the lowest absolute energy.

However, the success of a protein design (i.e. whether or not the protein folds) does not depend only on the absolute energy of the design. Rather it depends on the protein’s energy landscape. The energy landscape is a concept we use to think about all the possible ways that a string of amino acids can fold. As any Foldit player knows, there are a lot of different ways to fold up a string of amino acids, and they all have different energies (or Foldit scores). We can imagine the energy landscape as a surface where every (x,y) coordinate represents a different fold, or state, and the height of the surface (the z-coordinate) represents the energy of that state. In some places there will be hills, which represent states with a high energy (low Foldit score), and in other places there will be valleys, where folds have a low energy (high Foldit score).

Conceptual illustration of a protein energy landscape, from Dill, K.A. and MacCallum, J.L. (2012)

One of the reasons we like the analogy of energy landscapes is that we intuitively understand how things tend to “prefer” low points in the landscape. If you place an object randomly on the energy landscape, it will tend to slide downhill, from a high-energy state to a low-energy state. If we consider the effect of thermal motion that is constantly jostling around the object (imagine a Mexican jumping bean that randomly jumps around the landscape), then the object will explore all the different valleys of the energy landscape. Nevertheless, the Mexican jumping bean will spend the most time in the deepest valleys of the landscape.

A protein behaves the same way in its energy landscape. At room temperature, there is a considerable amount of thermal motion that allows the protein to explore its energy landscape, although the protein will spend the most time in the states with lowest energy. Every amino acid sequence has a different energy landscape, with different valleys in different places. When you mutate amino acids in a Foldit puzzle to find higher scores for your design, what you’re really doing is looking for an energy landscape where your design is in a deeper valley. However, the Foldit score only tells you about the energy of your designed folded state—or the “depth” of your desired valley. What we’re not considering in Foldit is the rest of the landscape, and whether there might be other low-energy “decoy states”—other deep valleys for your protein to explore.

This is a difficult problem to solve because the energy landscape for a protein is vast. It’s difficult to account for the decoy states because we don’t know what they might look like. We don’t know where to search in the energy landscape for other low-energy valleys, and the landscape is too big to search exhaustively.

The search for decoys

As many of you are probably aware, a lot of the recent De-novo Freestyle prediction puzzles have targeted Foldit player-designed proteins. The purpose of these puzzles is to look for low-energy decoy states, or alternative valleys in the energy landscape. We already run Foldit designs through Rosetta@home to look for decoy states—and for the most part, Rosetta@home seems to do a pretty good job. But occasionally Foldit players find solutions that Rosetta@home misses.

In the following example we're going to pick on fiendish_ghoul, because this energy landscape problem is clearly illustrated by two of their designs, shown below:

The protein on the left is a design originally from Puzzle 1331; the protein on the right is a design from Puzzle 1239. Beneath each cartoon protein structure is a scatter plot with the results from corresponding De-novo Freestyle puzzles that we posted using the sequence of each design. Each black point represents a solution, plotted with respect to its RMSD to the folded state (x-axis) and its energy (y-axis). Together these points give us a profile of the energy landscape for each protein. We see that the design on the left has a “funnelled” landscape, such that the lowest-energy solutions are those close to the folded state (RMSD close to zero) and solutions very different from the folded state (large RMSD) all have higher energies. In the design on the right, however, Foldit players identified a number of decoy states that are very different from the folded state (large RMSD), and have energy just as low as the folded state. These decoy states (marked with colored circles in the scatter plot) appear as “valleys” in the energy landscape of the protein.

The cartoon structures of these decoy states are shown below using the same rainbow coloring as above, with the N-terminus of the protein colored blue, and the C-terminus of the protein colored red:

In each of the decoy structures, all of the α-helices and β-strands are there, but it appears there is some ambiguity about where the helices should go. According to the solutions from the De-novo Freestyle puzzle, the three α-helices can fold in different arrangements around the central β-sheet, and all of these arrangements have similar energies. Since all of these states have similar energy, the protein will not have a strong preference for any single one of them.

Both of fiendish_ghoul's proteins were designed by optimizing their absolute energy, but the protein on the right has a problematic energy landscape. If we made these proteins in the lab, we would expect the protein on the left to be well-folded, and to spend most of its time in the designed state, since it appears to be the only deep valley in the landscape. However, we would expect the protein on the right to be poorly-folded, and to spend its time sampling all the different decoy states discovered by Foldit players.

Check back on Monday for the next blog post, where we’ll discuss these energy landscapes in more detail!

Edit: Read more in Part 2 and Part 3 of this blog series.

Bruno Kestemont Lv 1

August 29, 2018

Very clear and perfect illustrations, thanks!

What is the de puzzle number of the novo for Puzzle 1239 ? 1511 ?

To be sure I understood well:
-the left funnel (blue design) is straight vertical with little horizontal variation, like a deep canyon, which makes the fold very stable at all temperatures? (and the related design starts from top hand design first days to bottom refinement last day)
-the orange funnel looks like a deep open valleys: it has same low energy than in the canyon, but there, the "river" can change its shape too easily and the design is less stable (and less susceptible to have a stable function).
-the green design seems more stable than the orange one; but from energy minus 85, it could jump to very different the orange one. Which makes its function to change to much

What if you had a vertical "line" isolated on the right? Would it mean that the fold would be better than the original fiendish_goul one?

========
btw: did you select fiendish_goul's design because it folded well in the laboratory AND no other player could have an idea of the original fold ? (blind folding unless for fiendish_goul)

bkoep Staff Lv 1

August 30, 2018

The energy landscape on the left is from 1511: Unsolved De-novo Freestyle 129; the energy landscape on the right is from 1494: Unsolved De-novo Freestyle 84.

In these discussions, we are not concerned at all with the shape of the valleys in the energy landscape (e.g. how "broad" the valley is or how "steep" its sides are). To think about the energy landscape and partition function of a protein, we only need to know the depth of each valley.

In any case, you probably wouldn't want to make any conclusions about valley shape based on the plots above, where we've projected the entire energy landscape onto a single dimension (i.e. RMSD to the folded state, on the x-axis). Two states which appear close on this axis may not actually be very similar.

For example, looking at the energy landscape on the right, you might be tempted to say that the orange state and the green state are pretty similar to one another, since they both have an RMSD of about 10 Å relative to the folded state. However, we see in the boxes below that their structures are actually very different—the RMSD between the orange and green states is actually about 7 Å RMSD.

To properly visualize the energy landscape of a protein, you would want a dimension for each φ and ψ torsion in the protein backbone. That's 130 dimensions (!) for the 65-residue protein on the right. This is why we say a protein's energy landscape is vast. Because every bond in the protein backbone can rotate independently of all the others, we would need an axis for each torsion to fully define the protein's energy landscape (and even then we're ignoring all the rotatable bonds in the sidechains).

GetOffMyLawn Lv 1

September 02, 2018

I would use vector summation. Assign each of the torsions a unique vector, distributed evenly in "altitude" and "azimuth" (sorry, astronomy guy) and assign a magnitude to each vector that maps to the value of the torsion. [Don't think too hard about it, your brain will just melt :-)] Finally, add up up all the vectors, and voila, you get a unique point in 3D space(or even just around a circle, actually) . You might also then assign a certain color to each axis to help see it (or a dot for a 2D circle) , but i think just drawing the summed vector (from the origin) would be clearest, at least up to a certain density of displayed results. Of course, you must generate a rotatable / zoomable 3D object to properly view the results. (a little digital bling always makes things more fun.)

Useful? Don't ask me, this was my dog's idea. woof.