Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, unsurprisngly, it appears that applying a transformer to multiple sequence alignments extracts somewhat more spatial information about proteins than we had been able to previously squeeze out.

It's pretty clear at this point that the work led to a large improvement in psp scores, but there's literally nothing else groundbreaking about it; I don't mean that in a bad way, except to criticize all the breathless press about applications and pharma.



Well it did gave geoundbreking results, it is weird to see people dismisses it as "Not groundbrraking enough".


it was a nice improvement. that's fine. But it's ultimately just statistical modelling based on deep evolutionary information. It only works on homology modelling, it doesn't actually solve the larger protein structure prediction problem. Therefore it's not groundbreaking but a significant improvement.


I respectfully disagree. AlphaFold 2 demonstrated almost perfect performance for a multitude of proteins for which no meaningful templates were available -- hence, it was not doing homology modelling as it is generally understood, but ab initio protein structure prediction.

What I would support is that AlphaFold 2 does not solve the protein folding problem: how, as opposed to what to, a protein folds.


> it was not doing homology modelling as it is generally understood, but ab initio protein structure prediction.

Maybe according to the current definition of the term, which has drifted over the years. Homology modeling and "ab initio" structure prediction have been drifting toward each other for a long time. These days, the categories are separated by (an essentially arbitrary) sequence identity threshold. If you have a protein sequence with high homology to some other protein with a structure, then you're homology modeling. If you have no matches at all, you're doing "ab initio". In the middle, you have a gray area where you can mix the approaches and call it whatever you like.

This is not a pedantic point. If your method requires homology -- however distant and fragmented -- in order to work, then you're always limited to the knowledge in the database. Maybe we've sampled enough of protein space to get the major folds, but certainly, the databases don't have enough information to get the small details right.

I have never been a huge believer in the idea that we can go directly from protein sequence to protein structure simply using a mathematical model of physics, but that is the original meaning of "ab initio structure prediction", and if you could do it, it would be far more valuable than alphafold. At risk of making a trivially nerd-snipable metaphor, it's kind of like the difference between google translate and a theoretical model of human intelligence that understands concepts and can generate language. The latter is obviously immensely more capable than the former.


If CASP is calling methods that use any sequence similarity (the grey area) 'ab initio', that's disingenuous and intellectually dishonest.

ab initio means from nothing, and at most, you're allowed to have physically inspired force fields, not sequence similarity to known structures. I put a lot of effort into improving the state of the art in that area, but ultimately concluded it made more sense to concentrate experimental structural determination in the area that was most useful- in proteins that had unknown folds or no known homology (see https://scholar.google.com/citations?view_op=view_citation&h... for some previous work I did in this area).


> If CASP is calling methods that use any sequence similarity (the grey area) 'ab initio', that's disingenuous and intellectually dishonest.

The category is given the name, not the methods. People can use any method they like to solve the structures. The organizers are not zealots.

The ab initio portion of CASP consists of proteins that the organizers know have low sequence identity to anything in the existing databases. They represent proteins that are "difficult" to solve using what any practitioner might call homology modeling. That doesn't mean that you can't use a method that takes into account the biological databases -- and essentially all of the good methods do!

For example, the Rosetta method has competed in both the homology modeling and the ab initio categories for many years. They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits.

I haven't paid close attention to CASP in a long time, but I assume the competitor list still has tons of entries from people who cling tightly to the purist vision of ab initio modeling. They don't tend to do very well.


OK, be aware the person you're correcting has: competed in CASP (on a competitor team with Sali), and published papers with Baker on Rosetta methods (my paper is cited in the most RoseTTA paper).

"They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits."

that's the best description of what I believe AF2 is doing, but that AF2 is being marketed as not depending on any sequence similarity.

If the CASP folks really are saying "if you have 20% sequence identity and use the structure from that alignment it's ab initio"... that's really just totally misleading.

Of course, even ab initio methods are parameterized on biological information; for example, I used AMBER to do MD simulations and many of the force field terms were determined using spectroscopic data from fragments of biological models. That, however is ab initio, because nothing even as large as a single amino acid is parameterized.

I'm not saying there's anything wrong with homology modelling, or that the purist vision of ab initio is right. For practical purposes, exploiting subtle structural information through sequence alignment is a very nice way to save enormous amounts of computer time.


> OK, be aware the person you're correcting has: competed in CASP (on a competitor team with Sali), and published papers with Baker on Rosetta methods (my paper is cited in the most RoseTTA paper).

OK, great. Me too. I'm not saying anything controversial here. Right from the top of the "ab initio" tab on predictioncenter.org:

"Modeling proteins with no or marginal similarity to existing structures (ab initio, new fold, non-template or free modeling) is the most challenging task in tertiary structure prediction."


I think the more important question to resolve here is: did AlphaFold change anything with respect to structure prediction that enabled them to make accurate predictions in the complete absence of sequence similarity to proteins with known structure?

My understanding is no, they did the equivalent of template modelling, which uses sequence/structure relationships (that are more subtle than the ones you get from homology modelling).

I'm less interested in reconciling my internal mental model of psp wiht CASPs, than I am in understanding if AF2 is somehow able to get all the necessarily structural constraints through coevolution of amino acid pairs entirely without some (direct or indirect) learned relationship between the sequence similarity to known structures (be it even short fragments like helices).

If they really did do that, and nobody did it before, that's great and I will happily promote the DM work, as it supports what I said when I did CASP: ML and MD will eventually win, although in a way that exploits the rich sequence evolutionary information we have, rather than predominantly by having an accurate force field and good smapling methods.


how could they do ab initio? They depend on multiple sequence alignments.

If I'm mistaken about this then I'll happily take back what I said, but there's no way that AF2 could work wihtout MSAs, therefore, it is not ab initio.

Ah, OK checked the paper again. They're working on the "template" category which means there is structure-sequence information... maybe CASP organizers consider this ab initio ? The paper never mentions anything about ab initio predicitons. Is that what you're saying, that template methods are ab initio?


Just in case there is a confusion: there is a difference between available sequences (~300 million in standard protein sequence repositories) and structures (~170k structures in the PDB, perhaps about ~120k that are structurally non-redundant). A large amount of CASP14 targets have no available templates; in fact, many of them represented previously unseen topologies. However, all of them had some (in most cases, many) available sequences.

The commonly accepted definition of homology modelling implies using a known structure ("template") as a scaffold to model the protein's topology. Since there are many CASP14 targets without appropriate templates, AlphaFold 2 simply cannot "just do homology modelling".

I do take the point that the correct term is "free modelling" (it does not have, or does not use, any good structure as a template), and not "ab initio modelling" (it uses physics to fold the protein), though. A deep enough MSA is generally a requirement.


Again, it's entirely possible I missed some very subtle point in AF2's system, but my understanding is that each target AF2 predicted had an underlying structural template covering the majority of the domain and the mapping was established through the MSA.

IE, any MSAs would always include alignments to known protein structures. Are you saying their MSAs don't include alignments to known protein structures?

(the reason I'm asking all this is because if I'm mistaken, then AF2 did do something "interesting", but everything in the paper says that everything they did is template based. If they are just folding proteins using MSAs without alignments to protein structures, that's far more interesting. I don't think they did that.

edit: I've now reread the paper again, and I believe their claim of making predictions where there is no structural homology is incorrect from a technical perspective. I've communicated this to both the CASP organizers (whom I know) and DeepMind.


Yes: they predict structures using MSAs, without alignments to known protein structures in a majority of the cases.


OK if that's truly accurate, then they did make a significant accomplishment. However, I'm 99% certain (from reading the paper) that they actually do have alignments to structures, but the similarity is very low.

It would help if you coould point to one of the alignmennts they made that has no underlying structure (even a template fragment) support.

I reread the methods section, https://static-content.springer.com/esm/art%3A10.1038%2Fs415...

They train jointly on the results of genetic search and template search (template search). Can you show an example of a prediction made using only genetic search and not template search. Those templates are fastas made from PDB files, which, while not homology modelling, is definitely not "ab initio".


I've been in communication with several different teams and leaders at CASP and I've confirmed that this does appear to be the case.

I'm going to be a bit skeptical but if that's the case, then it really is a significant improvement. Glad to see that with just the idea, the academic community was able to reach near parity in a short time, demonstrating there was nothing unique to DM except their huge amount of compute, storage, and talent, and this would have happened in the next CASP anyway.


It's perfectly reasonable to describe a very large improvement as groundbreaking.


It seems amazing to me what the transformer can learn to SOTA levels, not just language but also images, video, code, math and proteins. Replacing so much handmade neural architecture with just one thing that does it all, that was an amazing step forward.


Just for my reference, what percentage of known but unfolded proteins (a wild guess is good enough), would you consider to be ab initio? How many don't have parts in any database?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: