Just gonna leave this here. https://www.biorxiv.org/content/10.1101/2024.02.29.5...

jhbadger · on March 14, 2024

I think you are missing what the Evo project is trying to do -- create a new prokaryotic genome through a generative model. How this would work would be like the earlier hand-made synthetic genomes like Synthia (Gibson et al, 2010).

In such a system you would take an existing bacterial cell and replace its genome with the newly synthesized version. The proteins and other molecules from the existing cell would remain (before eventually being replaced) and serve to "boot" the new genome.

pfisherman · on March 15, 2024

Sounds cool, but how you define success for something like that? I can copy a prokaryotic genome mutated based as some non-zero rate and it would probably be viable. Is that synthetic enough to count? Are they going for a minimal genome?

How about something more useful, lucrative, and easy to define success for like engineering a morphine synthesis pathway into E. coli or something.

Imo, if you are talking about synthetic biology, then their training data is insufficient. Synthetic bio explores a lot of design space that is far outside of anything you would see in nature. There the secret sauce would not be in the generative pretraining, but in the RL. Unfortunately bio experiments are noisy, slow, and expensive so good luck getting enough data before the heat death of the universe.

nextos · on March 14, 2024

It's an interesting endeavor, but there are some obvious safety concerns.

Within Prokaryotes, there is a lot of horizontal gene transfer. What if some of the synthetic sequences get into other organisms and spread out?

UniverseHacker · on March 14, 2024

Those genes would have to confer an evolutionary advantage or they would immediately be discarded/selected against. The chances of that happening are nil… we’re not going to come up with something more useful to bacteria than billions of years of natural selection. Synthetic biology with organisms produced to generate small molecules useful for humans is widely practiced but has the opposite problem- all of the engineering changes to the cells are constantly being selected against, and revert on their own.

samuell · on March 14, 2024

I tend to agree (the cell being in control and all the 4D interactions and epigenetics mechanisms etc), but out of curiosity, what would you say we also need?

COGlory · on March 14, 2024

For starters, chemical environment modeling. But also cells differentiate, so in any system you need to understand the differentiation, and how those differentiated cells will change the environment of other cells, based on the environment they encounter.

That's not to say you can't glean a ton from DNA, but there are some external inputs we may simply never know enough about to incorporate into the model. Ultimately DNA IS all you need...if you have perfect environmental information.

pfisherman · on March 14, 2024

The article I posted shows what is working better - the Olga Troyanskaya / David Kelley style models. There was another one (Kundaje group?) recently that used Hi-C data.

t_serpico · on March 14, 2024

https://onlinelibrary.wiley.com/doi/10.1002/bies.201300153 tl;dr: metabolism is all you need.

while potentially interesting work, very shortsighted and premature to say this is a "GPT" moment in biology. ML people in bio need to think hard not only about what they are doing, but why are they are doing it (other than this is cool and will lead to a nice Nature publication). Their basic premise (learning from DNA is the next grand challenge in biology) is shaky. Imo, the grand challenge in biology is determining what the grand challenge is, and that is a deep scientific/philosophical question.

dekhn · on March 14, 2024

most of the examples in that paper (a single paper) show that DNA is nearly all you need, with the rest being RNA.

pfisherman · on March 14, 2024

RNA is an obvious example. The examples and benchmarks they give in the paper are not the straw men the DNA LLMs are beating the stuffing out.

Also CRE activity is highly cell type specific. This article is a pretty awesome demonstration of model guided design of cell type specific cis regulatory elements.

https://www.biorxiv.org/content/10.1101/2023.08.08.552077v1

An LLM would not be able to do this because DNA itself contains no contextual information about cell type - every cell has a copy of the full genome. Epigenetic tracks however contain a lot of information germane to the cellular context - ex which parts of the genome are being transcribed.

dekhn · on March 14, 2024

but epigenetics is just DNA. it's state information stored directly in the DNA, or in directly attached machinery. from the perspective of learned models, those are just other features.

But realistically, the right source for transcription is the RNA in the cell, not the epigenetics. Nearly all cell type profiling is based on RNA. It's far easier and more reliable to interrogate the transcriptome than to try to gain info from epigenetic states.

pfisherman · on March 14, 2024

Epigenetics is not just DNA, think of it more like the (hidden) state of DNA. Histone modifications and open chromatin and other epigenetic readouts are like emissions / indicators of the hidden state.

The relationship is like that between the words in a book and the page that is actively being read. I know that is a hackneyed analogy; but coffee is wearing off :)

dekhn · on March 14, 2024

Those are all readable using standard DNA sequencing techniques, so again, it's just state attributes of the DNA.

(I've worked in genomics for 30+ years. I'm not just spitballing here).

pfisherman · on March 15, 2024

DNA libraries are read out using sequencing techniques. But I doubt anyone would say they are just measuring DNA. It’s kind of like saying a luciferase based assay is just measuring the intensity of light.

dekhn · on March 15, 2024

yes, that is the terminology we use.