VHL puzzle series paper preprint released!

Started by rmoretti

rmoretti Staff Lv 1

We're happy to announce that the VHL puzzle series results have been written up and are now available as a preprint on ChemRxiv. Please do note that ChemRxiv is a preprint server, not a scientific journal, so the paper is a draft and has not been peer reviewed yet. We're looking into journals for the official publication, but we wanted to give the Foldit community a chance to see the paper (and provide feedback) as soon as we could.

Ligand 10 was the starting point, the "bound diastereomer" was the molecule that was found in the determined crystal structure. Panel a shows the experimentally determined structure of the tested compound. Panel b shows the comparison with the starting structure. Panel c shows the comparison with the as-designed molecule in the context of the protein. Panel d shows the comparison of the designed molecule with the starting molecule.

A shoutout to Nicm25, who made the design on which the successful molecule was based. There were a few modifications to the design to ease synthesis, but the core idea of the molecule came from a Foldit player.

While the strength of binding of this compound is not as good as we might have hoped, we're anticipating that with the publication of the structure of this new compound other researchers can take the core idea and further refine it. (The starting molecule was the end result of an extensive series of optimizations.) The experimental structure of the ligand-protein complex has been submitted to the Protein DataBank, and should eventually be released as entry 8P0F.

Thanks once again to our collaborators at Boehringer Ingelheim (BI), who helped to evaluate the compounds to test, and who handled the laborious process of synthesizing, testing and determining the crystal structures for these compounds.

If you have any comments or suggestions about the paper, please feel free to send me (rmoretti) a PM about it. We're interested in incorporating any improvements prior to final publication.

jeff101 Lv 1

I tried downloading the 40 MB sdf file for "All Player Designed Compounds" from the link you listed above: https://doi.org/10.26434/chemrxiv-2023-lczzd

When I clicked the download icon, Microsoft Edge tried to open this 40 MB sdf file as if it were an ordinary ascii text file, but the file seemed too large for MS Edge to completely load it. Is there another way to make the 40 MB sdf file available to us, perhaps as a compressed version? I was able to download the other files without problems.

Thanks for posting this paper for us to preview it.

rosie4loop Lv 1

I was able to get the 40MB file via firefox. After clicking the link, use "Save page as…" to download the file.

In browser it only loaded 21MB when I copy it to a text file, but the "Save page as…" option gives the whole file.

Bruno Kestemont Lv 1

Thanks for the preview. Hopefully you can publish it.
What i found interesting: "The completion of one full design cycle is a proof of concept for the Drugit approach and highlights the potential of involving citizen scientists in early drug discovery"

-but it seems to give 1 good idea/thousands of ideas, which seems to be close to random (luck)
-this can show the potential of crowd research (expanding the potential number of ideas)
-but on the other hands, a lot of expert evaluation was needed to identify this: there might be room for improvement of the evaluation phase (automating or enhancing the foltit algorythm)
-this is discussed in the first part of paper (improving the "filters")
-I found the statistics in players interesting (S2 & other figures in supplementary material)

As a player, I'd like to know the following:
-who are the players of the 19 selected ligands ?
-how is it possible that some players submitted 100+ ideas ? Is it via "share to scientists" (in several rounds, that makes about 10 shares per round) or alse because you authomatically capured intermediate results ?
-I don't fully understand S4. Non-binding ligands have a higher range of solutions but also the highest scores (conform with what we see in-game).
-Are these scores including the filter bonus ?
-The ingame scoring function seem to favour binding ligands; is it desirable? is a non-binding ligand also a good candidate drug?
-The expert "playing during his free time" designed good candidates, non binding; was he/her purposely looking for "non interesting non binding" high scoring solutions in order to chalenge the Drugit algorithm?
-You seem to be concerned by the ligand itself. However, very different scores are seen in-game for the same ligand. Slightly different positions can give huge differences in scores. What is the role of good positionning the ligand for science ?
-What is the role of Libraries in this context ? Libraries are "perfect" molecules from this paper's perspective ? aren't they ?

May be Interesting to mention in the paper:
-foldit evolved from discovering 3D structures in the 2012's, but with alphaFold (AI) this is less usefull today;
-at this time, ideas of "recipes" converged to the Rosetta algorythm (it's not clear how the recipes influenced Rosetta afterwards)
-then Electron density based on cristallography gave interesting results, but it looks like the Player's skills are less interesting for this purpose today, why ?
-then Players found original "ideas" for de novo proteins; but players influencing each others, and due to the scoring system, ideas converged to the same kind of proteins, however (3 helices, surfing dogs etc)
-then players where asked to find binding proteins (I don't remember if there was any publication resulting from this)
-lately, you seem to look for "out of the box" ideas from players concerning ligands
-how do you see the future of Player's contribution ?

Another concern: groups versus blind soloists.
-the best solution came from a pure "blind" soloist (see his/her homepage)
-long time ago, the best scoring solutions always came from groups and evolvers
-evolvers are less encouraged today with the new scoring rule
-particulary for de novo designs (including binding proteins & small molecules), it's obvious that "collective intelligence" of groups could lead to a convergence to a (possibly non original "locked-in") solution. For small molecules, it's even relatively easy to visual copy a good scoring ligand (seen on the Wiki "result" page, or in-group sharing), even if the position of the ligand in protein still gives a lot of very different scores for the same ligand
-even within group, we tend to work "mainly blind", because only the best scoring solutions are in practice used by other players; moreover, for ligand puzzles, the best scoring solutions are not always shared to the group (because we are still competing as soloists). Same for favourite recipes (there is no "absolute best" recipe, all players use own prefered ones in personal ways). Thus, even if the groups favour some collective intelligence and "converging of ideas", there remain a good balance of "out of the box" thinking (from beginners to experimented players).

Thanks for reading

Nicm25 Lv 1

Thanks for preview, I checked it that preprint.

  • (unknown name section in page 14) I can't find an explanation that refers to Figure 1(c)
    Figure 1(b) by mistake? I'm not sure, little errata.
  • (Analysis of compound source in page 8)writed that I participated in 2020, but more precisely I have more than 10 years of experience with foldit.

I was closed for while until covid season at that time, and somehow my user data already has 'ncm25_OLD' that not used.
they were previously also used for recipes and other functions to test in devprev, ignore them because I would lose being pure soloist.

to everyone : content in my profile is for communitys among players, I don't expect this to be influenced for foldit game rules at all, their policy.
In order to expect more fairness, I myself did not want to be published as player name either. (that I have not refused, and ideally 'anonymous but identifiable')
that in itself will lead to someone confirming my methods, which is fine for uses that investigate methods and find out what is better, but we will not be soloists if we simply mass produce by copying methods, and will welcome todo so outside of foldit and games.
I would also like to reiterate that I have not formal medicinal chemistry experience myself, so even if want to adopt my methods, I recommend that verify them carefully, and know that I am only doing it based on experience, nothing is foolproof and I cannot guarantee accuracy.
I am starting to apply this to my projects that have nothing to do with foldit, blind style players, forget everything I know (or pretend it never happened at all) and I don't use it at all, for conform to patents, copyrights, and other licenses.
and what I'm this saying is, I just don't want to run into situation where someone force changers way other players do this,
all right, that's not something average players needs to worry about.
Thanks for reading.

rosie4loop Lv 1

I'm not involved in this paper, I haven't started playing Foldit yet last year. But maybe I'll try to answer some questions here briefly based on my knowledge, since they're frequently asked by many people who start to do structure-based drug design.

but it seems to give 1 good idea/thousands of ideas, which seems to be close to random

  • That's science! It's about serendipity, not luck! Thats what my seniors always telling me when I encounter failures.
  • Practically in virtual screening trials we evaluate millions or billions of compounds then select tens to hundreds of them based on score or other criteria for experimental validation. It's possible that none of them are actual binders.

What is the role of Libraries in this context

  • Generally they're "realistic" molecules. For example, entries from the Enamine REAL database can be purchased directly from vendors with fixed price. No need to do the synthesis in that case, so it's easier to do the validation.
  • See the original blogpost from the developers for details.

Slightly different positions can give huge differences in scores. What is the role of good positionning the ligand for science?

  • Theoretically the ligand has the highest probability to be found in the best position, so the interaction here is important for a strong and specific binding. Not the only factor that affects the binding, but one of the key factors.

(Edit: missing one of the explanations as I first paste it here)

rosie4loop Lv 1

Question on figure S3:
Why using the 2FoFc electron density map instead of the OMIT map to show the ligand density? The 2FoFc map is known to have model bias.
What is the contour level of the map?

rosie4loop Lv 1

An additional note after checking the "all players design/VHL all compounds" SDF file. Since the 160K Foldit scores in round 5 is due to a glitch, maybe it is better to remove the entries from round 5 for publication?

Although the artistic spacestations are cool.

rmoretti Staff Lv 1

@"Bruno Kestemont"

but it seems to give 1 good idea/thousands of ideas, which seems to be close to random (luck)

This is indeed a limitation of the entire process. An N=1 successful result doesn't necessarily give us great confidence interval on the typically achievable success rate.

However, the BI collaborators think things are useful for the "idea generation" potential. The core of the successful ligand is something that they typically wouldn't consider, so it gives a novel starting point for further exploration. (This actually matches typical drug development pipelines, where in early stages you scrounge around for a by-chance compound which gives a small amount of activity, and then you spend a bunch of effort to refine the compound into something that's a better drug. From that perspective the 1/19 compounds are a much better success rate than they typical high throughput screen conventionally used.)

how is it possible that some players submitted 100+ ideas

If you play online, the client periodically sends the structure you're working on (or rather your best structure in the past X minutes) to the server. For the VHL puzzles, we pooled all the structures people sent to the server: top scoring for the round, share-with-scientist and the automatic "in progress" uploads.

I don't fully understand S4

This was just to show how the ligands do in redocking – that is, can Rosetta predict which compounds will be successful? The upshot was that it's not particularly predictive, at least for the compounds which have made it through to the end selection.

The scoring in S4 is specifically the RosettaLigand predicted binding energy. It looks just at the protein-ligand interface energy and does not contain any Foldit filter bonuses.

is a non-binding ligand also a good candidate drug

No. The ligands have to bind to the protein in order to function. Compounds that don't bind aren't useful, except perhaps if you can figure out why they don't bind and fix them such that they do.

The expert "playing during his free time" designed good candidates, non binding; was he/her purposely looking for "non interesting non binding" high scoring solutions in order to chalenge the Drugit algorithm?

I don't think so. My understanding is that he was playing the game in earnest, like other Foldit players. It's just that when designing he was was designing with the "standard" medchem ideals in mind, such that it was much more likely that the compounds would be picked on the backend by the expert medicinal chemists. I'm not sure if that necessarily means that the compounds he designed were "better" than what other people designed (they did not work, after all), or if they simply just matched the preconceptions of what a compound "should" be, and were more likely to be selected on that basis.

What is the role of good positionning the ligand for science?

A good binding ligand has to be well positioned in order to bind - but it has to do that in the test tube. However, what the molecule does in the computer may or may not have anything to do with what it does in the test tube. We hope that our computational models are good enough to be predictive about what the compound does in the test tube, but at the end of the day its what happens experimentally that matters, not what happens computationally. The computational scores are just our best effort prediction of what actually is going to matter. (In fact, the BI scientists doing the compound selection didn't use Foldit scores to select molecules. The Foldit scores were important to guide which molecules where generated by the player, but once the molecules were generated we used other evaluation metrics to actually select which ones were tested.)

What is the role of Libraries in this context ?

The VHL puzzles didn't use the Compound Library. All the compounds that were tested were custom synthesized, without use of compound libraries.

Thanks for the other suggestions. We'll keep them in mind when editing for submission.