FunSearch: Making new discoveries in mathematical sciences using LLMs

jcgrillo · on Dec 14, 2023

To what extent is an LLM necessary here? As far as I can tell (and perhaps I haven't looked closely enough yet) the purpose of the LLM here is to generate things that look plausibly like python functions conforming to a given type signature.

But it should be possible to generate random, correct python functions conforming to a given type signature without an LLM. This would be an exercise like [1], just with a substantially more complex language. But might a restricted language be more ergonomic? Something like PushGP [2]?

So I guess my questions would be:

(1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?

(2) Are other genetic programming techniques less competitive on the same problems? Do they produce less fit solutions?

(3) If a more "traditional" genetic programming approach can achieve similar fitness, is there a difference in compute cost, including the cost to train the LLM?

[1] http://www.davidmontana.net/papers/stgp.pdf [2] https://faculty.hampshire.edu/lspector/push.html

janalsncm · on Dec 15, 2023

> it should be possible to generate random, correct python functions conforming to a given type signature without an LLM

The state space of viable programs is far, far larger than useful ones. You need more than monkeys and typewriters. The point of using Palm2 here is that you don’t want your candidates to be random, you want them to be plausible so you’re not wasting time on nonsensical programs.

Further, a genetic algorithm with random program generation will have a massive cold start problem. You’re not going to make any progress in the beginning, and probably ever, if the fitness of all of your candidates is zero.

jcgrillo · on Dec 15, 2023

That's an interesting hypothesis, but the work under discussion here didn't make an attempt to test it. You make two claims:

(1) That using an LLM (Palm2) somehow "saved time" by avoiding "nonsensical programs"

(2) That there exists a "cold start problem" (which, presumably, the LLM somehow solves).

Yet not the paper, nor the blog post, nor the code on github support these.

janalsncm · on Dec 15, 2023

The fact that LLMs generate text which is better than random is not something the authors need to spell out. Reducing the cross entropy between predicted and target distributions is the entire purpose of training.

Put another way: a flat probability distribution from an untrained network is equivalent to the random decoding you’re skeptical an LLM is better than. That’s not a criticism the authors’ peers would likely put forward.

YeGoblynQueenne · on Dec 16, 2023

>> Reducing the cross entropy between predicted and target distributions is the entire purpose of training.

Nope. Test distribution. Not target distribution.

janalsncm · on Dec 16, 2023

I’m not sure why you felt the need to correct me. Target is the correct terminology. As a sanity check I even looked up the PyTorch doc which explicitly uses “target” in their example.

https://pytorch.org/docs/stable/generated/torch.nn.CrossEntr...

YeGoblynQueenne · on Dec 16, 2023

Because the target distribution is not known, whatever PyTorch docs say. The test distribution is known and nobody knows what's its relation to the target distribution except a hope and a prayer.

famouswaffles · on Dec 15, 2023

What paper are you reading exactly ?

There's a clear comparison

https://imgur.com/a/xerOOLn

jcgrillo · on Dec 15, 2023

Same paper you are. This is comparing apples to oranges. Their "random" strategy is only capable of assembling very constrained python programs whereas the LLM is coming up with things that look like the sort of stuff a person might write, utilizing the whole language (modulo hallucinatory nonsense). So it's not a reasonable comparison at all.

Instead, for the purpose of making something resembling a reasonable comparison, it would make a lot more sense to train an LLM on programs written in the same language they allow their "random" strategy to use. I wouldn't recommend trying this in python, as it's way too big a language. Something smaller and more fit for purpose would likely be much more tractable. There are multiple examples from the literature to choose from.

EDIT: To be clear--it's obvious the LLM solution outperformed the hand-rolled solution. By a large margin. What would actually be an interesting scientific question, though, is to what extent is the success attributable to the LLM? That's not possible to answer given the results, because the hand rolled-solution and the LLM solution are literally speaking different languages.

janalsncm · on Dec 15, 2023

The purpose of the ablation is to demonstrate the value-add of using an LLM. Comparing to another type of candidate generator wouldn’t be ablation anymore, it would just be another study.

As I have mentioned previously, using an LLM addresses the cold start problem by immediately generating plausible-looking programs. The reason random token selection is limited to constrained programs is that valid programs are statistically unlikely to occur.

And the ablation directly demonstrates that the LLM is better than random.

YeGoblynQueenne · on Dec 16, 2023

Genetic programming algorithms don't start cold just as FunSearch doesn't start cold. It starts with a partial Python program with "missing" lines that the system has to fill-in. That's pretty much schema-guided program synthesis, to a t, down to the generate-and-test approach.

Unfortunately you won't find any references to that kind of earlier work in DeepMind's paper (as in most of their papers), so here's a paper I found on the internets with six second search that explains what that is:

Intuitively, a program schema is an abstraction of a class of actual programs, in the sense that it represents their data-flow and control-flow, but neither con- tains all their actual computations nor all their actual data structures. Program schemas have been shown to be useful in a variety of applications. In synthesis, the main idea is to simplify the proof obligations by taking the difficult ones offline, so that they are proven once and for all at schema design time. Also, the reuse of existing programs is made the main synthesis mechanism

https://ethz.ch/content/dam/ethz/special-interest/infk/inst-...

See section 4, "Schema-guided synthesis" page 41.

janalsncm · on Dec 16, 2023

The original question was “why would they use an LLM here”. Whether candidates are an entire program or just a portion of the program is immaterial to that question. In fact, it is an example of a situation where random decoding is especially poorly matched against an autoregressive model, since there’s already context to use.

Anyways, you can see there is a cold start issue in the ablation study since the random strategy didn’t even begin to register until about halfway through the test.

YeGoblynQueenne · on Dec 16, 2023

Of course if you start without any inductive biases (i.e. a program schema in the case of the FunSearch approach) there will be a "cold start" problem, but FunSearch doesn't start cold, as genetic programming doesn't start cold.

The question is the "cold start issue". That's what I'm addressing. Maybe you're not used to being corrected? Get used to it.

nybsjytm · on Dec 15, 2023

The found function is in here: https://github.com/google-deepmind/funsearch/blob/main/cap_s.... I'm not very familiar with genetic algorithms but it's pretty hard for me to imagine that they couldn't come up with this. I'd be surprised if too many people (if any) have tried.

On the other hand, as observed in Appendix A.2 of this paper, the non-LLM genetic approach would have to be engineered by hand more than the LLM approach.

jcgrillo · on Dec 15, 2023

I guess what I'm really trying to get at is this seems like a huge missed opportunity to show their LLM is actually doing something cool. Like if it somehow significantly outperforms established genetic programming update strategies that should be pretty easy to demonstrate given their experimental setup, but no attempt was made. That's kinda bizarre... like, in what sense is this an advancement of the field?

resource0x · on Dec 15, 2023

They need to show something to keep the excitement around Google's upcoming AI alive. Expect a slew of (non-)discoveries to continue unabated going forward.

nopinsight · on Dec 15, 2023

Genetic algorithms, even with programmed constraints, will come up with many non-sensical programs, although they could be mostly syntactically correct (given sufficient effort).

The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.

The above should answer your 1) and 2).

On 3), a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.

jcgrillo · on Dec 15, 2023

> The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.

Yet, sadly, the work under question here didn't make any attempt to show this, AFAICT. It's an interesting hypothesis, but as yet untested.

> a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.

Again, an interesting hypothesis that probably could be investigated with the experimental apparatus described in the paper. But it wasn't. Have you?

nopinsight · on Dec 15, 2023

>> The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.

> Yet, sadly, the work under question here didn't make any attempt to show this, AFAICT. It's an interesting hypothesis, but as yet untested.

It would be ideal to do that. But given the way LLMs work, it is almost a given. This could be the reason it didn't occur to the researchers who are very familiar with LLMs.

>> a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.

> Again, an interesting hypothesis that probably could be investigated with the experimental apparatus described in the paper. But it wasn't. Have you?

Perhaps it's quite evident to people in the trench, as the hypothesis seems very plausible to me. A numerical confirmation would be ideal as you suggested.

Given my management experience, I'd say it's smart of them to focus their limited resources on something more innovative in this exploratory work. Details can be left for future work or worked out by other teams.

jcgrillo · on Dec 15, 2023

> almost a given

Not a scientifically compelling argument.

> limited resources

What? This is Google we're talking about. Right?

nopinsight · on Dec 15, 2023

Limited resources: time of highly competent researchers, management time

Science takes time. A single paper needs not completely answer everything. Also, in many fields, it’s common to find out what works before why it works.

anonzzzies · on Dec 15, 2023

Inductive program synthesis stood still for decades because the space is far too large to get anything beyond absolutely trivial programs; LLMs vastly trim (in many wrong ways too of course) the search space and then you can apply ips to finetune and test. You cannot do this (not in any way we know of) without LLMs because it just tests billions of programs that make absolutely no sense even for trivial cases.

YeGoblynQueenne · on Dec 16, 2023

Quantify "vastly trim" for me please. All I see is a firehose of compute that spams and spams millions of samples until it hits a lucky guess. That's not trimming anything, it's the computing equivalent of the human wave attack.

But, please quantify "vastly trim".

Edit: the subject of my PhD research was an inductive program synthesis approach and while I agree with the bit about the size of program search spaces, LLMs are very clearly not a solution to that. Deepmind's "solution" is instead their massive computational capacity which makes it possible to search deeper and wider in the search space of Python programs.

FunSearch is like one of those rocket cars that people make once in a while to break land speed records. Extremely expensive, extremely impractical and terminally over-specialised to do one thing, and do that thing only. And, ultimately, a bit of a show.

anonzzzies · on Dec 16, 2023

Well the approach we use (and has been used at MS research before gpt was a thing) is to ask the llm to provide guesses for instructions to get to a (sub) solution and then let ips (in prolog in our case) do the rest. It seems to work for what we are doing.

Disclaimer: did my dissertation on ips (with genetic programming), but in the 90s when we basically had no computer power. The results were horrible outside trivialities, now they aren't anymore.

YeGoblynQueenne · on Dec 16, 2023

Well, if you're willing to use Prolog I commend you, but then you don't need an LLM because you don't need to do a dumb search of a large space. You can instead use second-order Resolution for induction. Here's how it's done (a.k.a. shameless plug of my PhD work):

https://github.com/stassa/louise/blob/master/data/examples/a...

>> Disclaimer: did my dissertation on ips (with genetic programming), but in the 90s when we basically had no computer power. The results were horrible outside trivialities, now they aren't anymore.

But that's because the training data and the model parameters keep increasing. Performance keeps increasing but not because capabilities improve, only because there's more and more resources spent on the same old problems. There's better ways than that. All the classical AI problems, planning, verification, SAT, now have fast solutions thanks to better heuristics, not thanks to more data (that are useless to them anyway). That's a real advance.

I still would like a quantification of the "vast trim" btw. Numbers, or it didn't happen.

anonzzzies · on Dec 16, 2023

> but then you don't need an LLM because you don't need to do a dumb search of a large space

we do, because we use our prolog ips library, but it's doing this for another language (a subset of python)

> I still would like a quantification of the "vast trim" btw. Numbers, or it didn't happen.

Agreed, I will get numbers and until then it didn't happen.

YeGoblynQueenne · on Dec 16, 2023

Interesting. Do you have a transpiler from Prolog to Python-subset? I would have preferred to use the LLM to do the translation, instead, and let the Prolog-based IPS (is it based on bottom clause construction?) do what it does best.

optimalsolver · on Dec 17, 2023

Sorry, for those of us unfamiliar with Prolog, can you explain in layman terms how you do program synthesis without a large search of program space?

I’m also curious if this scales to real world problems.

YeGoblynQueenne · on Dec 20, 2023

Oops, sorry, I just saw your comment.

I'm working on a paper about this but I guess that won't be "layman terms". Let me try to give a simple explanation. I'm assuming "layman" means "layman programmer" :)

In the simplest of terms that I think can be understood by a layman programmer then, second-order SLD-Resolution in Meta-Interpretive Learning (the method I discuss above) doesn't need to search for a program in a large program space because it is given a program, a higher-order logic program that is specialised into a first-order program during execution.

Without having to know anything about the execution of Prolog programs you can think of how a program in any programming language is executed. The execution doesn't need to search any program space, it only needs to calculate the output of the program given its input. The same goes for the execution of a Prolog program, but here the result of the execution is a substitution of the variables in the Prolog program. In the case of a higher-order program, the substitution of variables turns it into a first-order program (that's what I mean when I say that the higher-order program is "specialised").

Here's an example (using Louise, linked above). The following are the training examples and first- and second-order background knowledge to learn the concept of "even". The terminology of "background knowledge" is peculiar to Inductive Logic Programming, what it really refers to is the higher-order program I discuss above:

  ?- list_mil_problem(even/1).
  Positive examples
  -----------------
  even(4).
  
  Negative examples
  -----------------
  :-even(3).
  
  Background knowledge (First Order)
  ----------------------------------
  zero/1:
  zero(0).
  
  prev/2:
  prev(1,0).
  prev(2,1).
  prev(3,2).
  prev(4,3).
  
  Background knowledge(Second Order)
  ----------------------------------
  (M1) ∃.P,Q ∀x: P(x)← Q(x)
  (M2) ∃.P,Q,R ∀.x,y: P(x)← Q(x,y),R(y)
  
  true.

Given this problem, Louise returns the following first-order program:

  ?- learn(even/1).
  even(A):-zero(A).
  '$1'(A):-prev(A,B),even(B).
  even(A):-prev(A,B),'$1'(B).
  true.

If you squint a bit you'll see that the clauses that make up this first-order program are the clauses of the second-order "background knowledge", with their existentially quantified variables substituted for predicate symbols: 'even', 'zero' and 'prev'. Those symbols are taken from the first-order background knowledge (i.e. the first-order subset of the higher-order program) and the examples. There is an extra symbol, '$1', which is an invented predicate symbol, not found in the background knowledge or examples and instead automatically generated during execution. If you squint a bit harder you'll see that the clause '$1'(A):-prev(A,B),even(B). with that symbol in its head is a definition of the concept of "odd". So Louise learned the concept of "even" by inventing the concept of "odd".

The natural question to ask here is, I suspect, couldn't you do the same thing just by filling-in the blanks in the second-order clauses in the background knowledge, as if they were dumb templates? You could, and in fact that's the done thing in inductive program synthesis. But that's when you end up searching the space of all logic programs (i.e. the space of sets of first-order clauses). There's a whole bunch of programs that you could construct that way and you'd have to decide which to keep. To do that you'd have to construct each of them, so you'd have to construct the powerset of all constructible clauses. Powersets grow exponentially with the size of their base set, leading to combinatorial explosion. The win of using SLD-Resolution is that it only needs to find substitutions of variables, which can be done efficiently, and it only returns those substitutions of the variables in the second-order program that prove the positive examples (and disprove the negative ones). So it's always right (SLD-Resolution is sound and complete).

>> I’m also curious if this scales to real world problems.

Depends on what you mean "real world problems". The largest program I've learned with Louise is a 2500-clause program, but that's for a dumb problem that only serves as a benchmark (the problem is to find every way to go from a point A to a point B on an empty grid-world; it's surprisingly combinatorially hard and other methods die before they get anywhere near the 2500 clause mark; because the program search space blows up immediately).

In truth, real-world Prolog programs rarely need to be much bigger than a handful of clauses, like five or six. At that point, like in any programming language, you break your program up into smaller sub-programs, and those can be learned easily enough (and not just by second-order SLD-Resolution, to be fair). The challenge is to figure out how to do this breaking-up automatically. Meta-Interpretive Learning with Second-Order SLD-Resolution can do it up to a point, with predicate invention, as in the example above and also because any program learned can go directly into the "background knowledge" to be reused in a new learning session. But that still hasn't been done in a systematic manner.

Right now I'm working on a project to grant autonomous behaviour to a robot boat used in search-and-rescue missions. This is a fairly large problem and there are certainly challenges to do with efficiency. In fact efficiency is the main challenge. But from where I'm standing that's an engineering challenge.

So to answer your question: oh yes ^_^

optimalsolver · on Dec 25, 2023

Thanks for the detailed answer.

And congrats on finishing the thesis!

samth · on Dec 17, 2023

This is not correct about the current state of the art in program synthesis, which is a field that's made a lot of progress recently. Unfortunately people coming from the AI world tend to ignore that work entirely and imagine they're the only people with new ideas since the 80s.

mejutoco · on Dec 15, 2023

> (1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?

I thought that stochastic (random) gradient descent and LLM were converging much quicker than genetic programming. Definitely much quicker than random search.

nybsjytm · on Dec 14, 2023

Some important context: the discovery is that a certain number in combinatorics is now known to be between 2.2202 and 2.756, not just between 2.218 and 2.756 as discovered last year. The improvement is by finding some particular sequences of numbers which have some special properties, not by a logic-heavy mathematical proof. (That doesn't mean it's unrigorous.)

But it's an interesting and possibly useful method of coming up with examples, pretty much a genetic algorithm with LLMs.

ignoramous · on Dec 14, 2023

Related commentary on "self-play" by Subbarao: https://twitter.com/rao2z/status/1728121216479949048

From the article:

  FunSearch uses an evolutionary method powered by LLMs, which promotes and develops the highest scoring ideas. These ideas are expressed as computer programs, so that they can be run and evaluated automatically.

  The user writes a description of the problem in the form of code. This description comprises a procedure to evaluate programs, and a seed program used to initialize a pool of programs.

  At each iteration, FunSearch selects some programs from the current pool. The LLM creatively builds upon these, and generates new programs, which are automatically evaluated. The best ones are added back to the pool of existing programs, creating a self-improving loop.

For websearch, I (evaluator) use pplx.ai and phind.com in a similar manner. Ask it a question (seed) and see what references it brings up (web links). Refine my question or ask follow-ups (iterate) so it pulls up different or more in-depth references (improve). Works better in unearthing gems than sifting through reddit or Google.

Given Tech Twitter has amazing content too, looking forward to using Grok for research, now that it is open to all.

alphabetting · on Dec 14, 2023

https://twitter.com/gfodor/status/1735348301812383906

>If DeepMind just definitively proved neural networks can generate genuinely new knowledge then it’s the most important discovery since fire.

If this were actually the case why wouldn't everyone be talking about this? I am impressed it was done on Palm 2 given that's less advanced than GPT-4 and Gemini. Will be wild to see what the next few generations of models can do utilizing methods like this.

Imnimo · on Dec 14, 2023

The heavy lifting here is done by the evolutionary algorithm. The LLM is just being asked "propose some reasonable edits to these 20 lines of python" to replace a random mutation operator. It feels generous to credit the neural network with the knowledge generation here.

It's also highly dependent on the nature of the problem, even beyond the need to have the "hard to generate, easy to evaluate" structure. You have to be able to decompose the problem in such a way that a very short Python function is all that you want to evolve.

nopinsight · on Dec 14, 2023

Having the right hunches on what to try based on accumulated knowledge & experiences is a key thing that distinguishes masters from apprentices.

A fun story from a UCLA math PhD: “terry tao was on both my and my brother's committee.

he solved both our dissertation problems before we were done talking, each of us got "wouldn't it have been easier to...outline of entire proof"”

https://twitter.com/AAAzzam/status/1735070386792825334

Current LLMs are far from Terence Tao but Tao himself wrote this:

“The 2023-level AI can already generate suggestive hints and promising leads to a working mathematician and participate actively in the decision-making process. When integrated with tools such as formal proof verifiers, internet search, and symbolic math packages, I expect, say, 2026-level AI, when used properly, will be a trustworthy co-author in mathematical research, and in many other fields as well.”

https://unlocked.microsoft.com/ai-anthology/terence-tao/

Imnimo · on Dec 14, 2023

Do you think the role the LLM plays in this system is analogous to what Tao is talking about?

nopinsight · on Dec 14, 2023

What Tao does when proposing an idea most likely encapsulates much more than what a current LLM does. I’m no Terence Tao but I sometimes come up with useful ideas. In a more complex case, I revise those ideas in my head and sometimes on paper several times before they become useful (analogous to using evolutionary algorithms).

However, it is impractical to think consciously of all possible variations. So the brain only surfaces ones likely to be useful. This is the role an LLM plays here.

An expert or an LLM with more relevant experiences would be better at suggesting these variations to try. Chess grandmasters often don’t consciously simulate more possibilities than novices.

Imnimo · on Dec 14, 2023

Critically, though, the LLM is not acting as a domain expert here. It's acting as a code mutator. The expertise it brings to the table is not mathematical - Codey doesn't have any special insight into bin packing heuristics. It's not generating "suggestive hints and promising leads", it's just help the evolutionary search avoid nonsense code mutations.

nopinsight · on Dec 14, 2023

An LLM serves as a sort of “expert” programmer here. Programming itself is a pretty complex domain in which a purely evolutionary algorithm isn’t efficient.

We can imagine LLMs trained in mathematical programming or a different domain playing the same role more effectively in this framework.

Imnimo · on Dec 14, 2023

I disagree that the type of coding the LLM is doing here is "expert" level in any meaningful sense. Look for example at the code for the bin-packing heuristics:

https://github.com/google-deepmind/funsearch/blob/main/bin_p...

These aren't complex or insightful programs - they're pretty short simple functions, of the sort you typically get from program evolution. The LLM's role here is just proposing edits, not leveraging specialized knowledge or even really exercising the limits of existing LLMs' coding capabilities.

nopinsight · on Dec 14, 2023

To be clearer, the word “expert” in my comment above is in quotation marks because an LLM has more programming expertise than a non-programmer or an evolutionary algorithm, but not anywhere near a true expert programmer.

famouswaffles · on Dec 14, 2023

There's nothing generous about it. It wouldn't be possible without the LLM. Traditional Computational solvers fall well short as they say and demonstrate.

>The LLM is just being asked "propose some reasonable edits to these 20 lines of python" to replace a random mutation operator.

Hahaha "Just".

Imnimo · on Dec 14, 2023

We can see from the ablations that simply asking the LLM to generate a large number of potential solutions without the evolutionary search performs terribly.

It's certainly true that the LLM mutator produces more reasonable edits than a random mutator. But the value the LLM is bringing is that it knows how to produce reasonable-looking programs, not that it has some special understanding of bin packing or capset finding.

The only thing the LLM sees is two previously generated code samples, labeled <function>_v0 and <function>_v1 and then a header for <function>_v2 which it fills in. Look at Extended Data Fig. 1.

famouswaffles · on Dec 14, 2023

With the current competency or state of the models today, both are necessary, nobody is denying that.

It doesn't take a genius however to see who the more valuable contributor to the system is. The authors say as much when they anticipate much of any improvement to come from capabilities of better models than any change to the search algorithm. Gpt-4 may well have reduced the search time by half and so on.

>But the value the LLM is bringing is that it knows how to produce reasonable-looking programs, not that it has some special understanding of bin packing or capset finding.

So it is a general problem solver then, great. That's what we want.

Imnimo · on Dec 14, 2023

>It doesn't take a genius however to see who the more valuable contributor to the system is.

Indeed, we can look at the ablations and see that swapping out a weaker code LLM (540B Codey to 15B Starcoder) makes little difference:

"This illustrates that FunSearch is robust to the choice of the model as long as it has been trained sufficiently well to generate code."

>So it is a general problem solver then, great. That's what we want.

In the sense that it does not rely on the LLM having any understanding of the problem, yes. But not in the sense that it can be applied to problems that are not naturally decomposable into a single short Python function.

famouswaffles · on Dec 14, 2023

>Indeed, we can look at the ablations and see that swapping out a weaker code LLM (540B Codey to 15B Starcoder) makes little difference

Really stretching the meaning of "little difference" here

"While ‘StarCoder’ is not able to find the full-sized admissible set in any of the five runs, it still finds large admissible sets that improve upon the previous state of the art lower bound on the cap set capacity."

https://imgur.com/a/xerOOLn

Literally telling you how much more valuable the LLM (any coding LLM) is than alternatives, while demonstrating prediction competence still matters.

Imnimo · on Dec 14, 2023

I don't think five runs is sufficient to reach the conclusion you're reaching here, especially when we can see that one of the Codey runs is worse than all of the StarCoder runs. If Codey is so much more valuable, why is this happening?

It's certainly true that LLM's are more valuable than the alternative tested - which is a set of hand-crafted code mutation rules. But it's important to think about why an LLM is better. There are two big pitfalls for the hand-crafted rules. First, the hand-crafted rule has to make local changes - it's vanishingly improbable to be able to do something like "change < to <= in each of this series of if-statements" or "increase all constants by 1", whereas those are natural simple changes that an LLM might make. The second is that the random mutations have no concept of parsimonious code. There's nothing preventing it from generating code that does stuff like compute values and then neglect to return them, or multiply a previous computation by zero, or any number of other obviously useless variations.

What the LLM brings to the table here is the ability to avoid pitfalls like the above. It writes code that is shaped sensibly - it can make a more natural set of edits than "just randomly delete an operator". But that's all it's doing. That's not, like the tweet quoted above called it, "generating genuinely new knowledge".

Put it this way - assuming for a moment that Codey ending up with a higher score than StarCoder is not random chance, do you think it's because Codey has some greater understanding of the admissible-set problem, or because Codey generates a different set of minor code edits?

peddling-brink · on Dec 14, 2023

Agreed. If the phrase “yeah well you could also do that with a group of moderately trained people” applies, you have a use case for an LLM.

colechristensen · on Dec 14, 2023

People might argue about the weights of credit to the various parts. LLMs as they are by themselves are quite limited in their creative potential to do new things or generally make progress, but if you couple their capabilities (as is done here) with other tech then their powers really shine.

Likewise when that coupled "tech" is a person. LLMs don't do my job or further my projects by replacing me, but I can use it to generate examples in seconds that would take minutes or hours for me and it can provide ideas for options moving forward.

futureshock · on Dec 14, 2023

I said “WOW!” out loud.

An LLM can discover a new solution in high dimensional geometry that hasn’t advanced in 20 years!? That goes way beyond glueing little bits of plagiarized training data together in a plausible way.

This suggest that there are hidden depths to LLMs’ capabilities if we can just figure out how to prompt and evaluate them correctly.

This significantly broke my expectations. Who knows what discovery could be hiding behind the next prompt and random seed.

nybsjytm · on Dec 14, 2023

> An LLM can discover a new solution in high dimensional geometry that hasn’t advanced in 20 years!? That goes way beyond glueing little bits of plagiarized training data together in a plausible way.

The first advance in 20 years was by Fred Tyrrell last year https://arxiv.org/abs/2209.10045, who showed that the combinatorics quantity in question is between 2.218 and 2.756, improving the previous lower bound of 2.2174. DeepMind has now shown that the number is between 2.2202 and 2.756.

That's why the DeepMind authors describe it as "the largest increase in the size of cap sets in the past 20 years," not as the only increase. The arithmetic is that 2.2202-2.218 is larger than 2.218-2.2174. (If you consider this a very meaningful comparison to make, you might work for DeepMind.)

bendergarcia · on Dec 14, 2023

It’s kind of like humans: seed of two people and a random interest to pursue, what could they do?!? It makes poverty and children dying unnecessarily even more depressing.

a_wild_dandan · on Dec 15, 2023

> That goes way beyond glueing little bits of plagiarized training data together

Moreover, LLMs almost ALWAYS extrapolate, and never interpolate. They don't regurgitate training data. Doing so is virtually impossible.

An LLM's input (AND feature) space is enormous. Hundreds or thousands of dimensions. 3D space isn't like 50D or 5,000D space. The space is so combinatorially vast that basically no two points are neighbors. You cannot take your input and "pick something nearby" to a past training example. There IS NO nearby. No convex hull to walk around in. This "curse of dimensionality" wrecks arguments that these models only produce "in distribution" responses. They overwhelmingly can't! (Check out the literature of LeCun et al. for more rigor re. LLM extrapolation.)

LLMs are creative. They work. They push into new areas daily. This reality won't change regardless of how weirdly, desperately the "stochastic parrot" people wish it were otherwise. At this point they're just denialists pushing goalposts around. Don't let 'em get to you!

happypumpkin · on Dec 15, 2023

> They don't regurgitate training data.

While I very much do not think this is all they do, I don't think this statement is correct. Some research indicates that it is not:

https://not-just-memorization.github.io/extracting-training-...

Anecdotally, there were also a few examples I tried earlier this year (on GPT3.5 and GPT4) of being able to directly prompt for training data. They were patched out pretty quick but did work for a while. For example, asking for "fast inverse square root" without specifying anything else would give you the famous Quake III code character for character, including comments.

a_wild_dandan · on Dec 15, 2023

Your examples at best support, not contradict, my position.

1. Repeating "company" fifty times followed by random factoids is way outside of training data distribution lol. That's actually a hilarious/great example of creative extrapolation.

2. Extrapolation often includes memory retrieval. Recalling bits of past information is perfectly compatible with critical thinking, be it from machines or humans.

3. GPT4 never merely regurgitated the legendary fast root approximation to you. You might've only seen that bit. But that's confusing an iceberg with its tip. The actual output completion was on several hundred tokens setting up GPT as this fantasy role play writer who must finish this Simplicio-style dialogue between some dudes named USER and ASSISTANT, etc. This conversation, which does indeed end with Carmack's famous code, is nowhere near a training example to simply pluck from the combinatorial ether.

happypumpkin · on Dec 15, 2023

> random factoids

The "random factoids" were verbatim training data though, one of their extractions was >1,000 tokens in length.

> GPT4 never merely regurgitated

I interpreted the claim that it can't "regurgitate training data" to mean that it can't reproduce verbatim a non-trivial amount of its training data. Based on how I've heard the word "regurgitate" used, if I were to rattle off the first page of some book from memory on request I think it would be fair to say I regurgitated it. I'm not trying to diminish how GPT does what it does, and I find what it does to be quite impressive.

jcgrillo · on Dec 15, 2023

Do you have a specific reference? I've mostly ignored LLMs until now because it seemed like the violent failure mode (confident + competent + wrong) renders them incapable of being a useful tool[1]. However this application, combined with the dimensionality idea, has me interested.

I do wish the authors of the work referenced here made it more clear what, if anything, the LLM is doing here. It's not clear to me it confers some advantage over a more normal genetic programming approach to these particular problems.

[1] in the sense that useful, safe tools degrade predictably. An airplane which stalls violently and in an unrecoverable manner doesn't get mass-produced. A circular saw which disintegrates when the blade binds throwing shrapnel into its operator's body doesn't pass QA. Etc.

a_wild_dandan · on Dec 15, 2023

"Learning in High Dimension Always Amounts to Extrapolation" [1]

[1] https://arxiv.org/abs/2110.09485

jcgrillo · on Dec 15, 2023

Thank you. I'll give this a look.

riku_iki · on Dec 14, 2023

> An LLM can discover a new solution in high dimensional geometry that hasn’t advanced in 20 years!?

LLM + brute force coded by humans.

futureshock · on Dec 14, 2023

And that not only created a human interpretable method, but beat out 20 years of mathematicians working on a high profile open question.

Who’s to say our brains themselves aren’t brute force solution searchers?

riku_iki · on Dec 14, 2023

> Who’s to say our brains themselves aren’t brute force solution searchers?

yes, its just much slower (many trillions times) for numbers crunching.

famouswaffles · on Dec 14, 2023

Neural networks have been able to "generate new knowledge" for a long time.

So have LLMs, https://www.nature.com/articles/s41587-022-01618-2

lossolo · on Dec 14, 2023

From the paper:

We note that FunSearch currently works best for problems having the following characteristics: a) availability of an efficient evaluator; b) a “rich” scoring feedback quantifying the improvements (as opposed to a binary signal); c) ability to provide a skeleton with an isolated part to be evolved. For example, the problem of generating proofs for theorems [52–54] falls outside this scope, since it is unclear how to provide a rich enough scoring signal.

didibus · on Dec 15, 2023

In this example, it's relatively limited I feel to finding new algorithms/functions. Which is great, but compared to the discovery of fire, and tons of things in between, like say electricity, it wouldn't feel in the same ballpark at all.

empath-nirvana · on Dec 14, 2023

In some sense, it's not especially interesting because millions of programmers are doing this with copilot every day. Like I get that in a lot of cases it's _just_ applying common knowledge to new domains, but it is novel code for the most part.

fatherzine · on Dec 14, 2023

Don't Look Up

aconz2 · on Dec 14, 2023

summary: given a program template/skeleton and a fitness function (# correct results, shorter programs, etc), generate a population of programs with LLM, use a prompt that generates a new program from k other versions (they found k=2 is good, kinda biological eh), run the programs on inputs and score them with the fitness function, uses the island model for evolution

I think the prompt looks in principle something like

  def foo_v1(a, b): ...
  def foo_v2(a, b): ...
  # generate me a new function using foo_v1 and foo_v2. You can only change things inside two double curly braces like THIS in {{ THIS }}
  # idk not a "prompt engineer"
  def foo(a, b): return a + {{}}

They achieved the new results with only ~1e6 LLM calls (I think I'm reading that right) which seems impressively low. They talk about evaluating/scoring taking minutes. Interesting to think about the depth vs breadth tradeoff here which is tied to the latency vs throughput of scoring an individual vs population. What if you memoize across all programs. Can you keep the loss function multidimensional (1d per input or input bucket) so that you might find a population of programs that do well in different areas first and then it can work on combining them.

Did we have any prior on how rare the cap set thing is? Had there been previous computational efforts at this to no avail? Cool nonetheless

brotchie · on Dec 15, 2023

Paraphrasing a post on Twitter / X:

Things will only get better from here.

i.e.

AI capabilities are strictly monotonically increasing (as they have been for decades), and in this case, the capabilities are recursively self-improving: I'm already seeing personal ~20-30% productivity gains in coding with AI auto-complete, AI-based refactoring, and AI auto-generated code review diffs from comments.

I feel like we've hit a Intel-in-the-90s era of AI. To make your code 2x as fast, you just had to wait for the next rev. of Intel CPUs. Now it's AI models, once you have parts of a business flow hooked up with a LLM system (e.g. coding, customer support, bug triaging), "improving" the system amounts to swapping out the model name.

We can expect a "everything kinda getting magically better" over the next few years, with minimal effort beyond the initial integration.

jcgrillo · on Dec 15, 2023

AFAICT neither the blog post nor the linked paper showed anything like this. Specifically, they didn't make any comparison between results obtained _with_ an LLM vs results obtained _without_ one. IIUC, this paper showed results obtained by genetic programming using an LLM to generate a python kernel function conforming (maybe) to a given type signature. You don't _need_ an LLM to do this.

So the question of whether the LLM in particular does anything special here is still wide open.

karencarits · on Dec 14, 2023

One of the problems they were approaching was the cap set problem

https://en.m.wikipedia.org/wiki/Cap_set

> The problem consists of finding the largest set of points (called a cap set) in a high-dimensional grid, where no three points lie on a line. This problem is important because it serves as a model for other problems in extremal combinatorics - the study of how large or small a collection of numbers, graphs or other objects could be. Brute-force computing approaches to this problem don’t work – the number of possibilities to consider quickly becomes greater than the number of atoms in the universe.

> FunSearch generated solutions - in the form of programs - that in some settings discovered the largest cap sets ever found. This represents the largest increase in the size of cap sets in the past 20 years. Moreover, FunSearch outperformed state-of-the-art computational solvers, as this problem scales well beyond their current capabilities.

mejutoco · on Dec 14, 2023

I wonder how someone will integrate symbolic reasoning with LLMs, or if it will be possible.

nimski · on Dec 14, 2023

This is what we're doing. I think it's not only possible, but also necessary for applications beyond trial-and-error generation.

Malp · on Dec 14, 2023

Ditto, this seems to have some parallels to the neurosymbolic ideas being explored by Lab V2 at ASU

1024core · on Dec 14, 2023

JoelJacobson · on Dec 19, 2023

The recent DeepMind paper on FunSearch highlighted their use of pre-trained large language models (LLMs) for generating code improvements. Interestingly, while the main LLM used was Codey, built on the PaLM2 model family, they also referenced StarCoder, an open-source LLM, in their supplementary information.

However, the GitHub repository for FunSearch doesn't include implementations for these LLMs. For instance, in `sampler.py`:

``` class LLM: """Language model that predicts continuation of provided source code."""

  def __init__(self, samples_per_prompt: int) -> None:
    self._samples_per_prompt = samples_per_prompt

  def _draw_sample(self, prompt: str) -> str:
    """Returns a predicted continuation of `prompt`."""
    raise NotImplementedError('Must provide a language model.')

```

This code suggests the need for an external LLM implementation. Given that they successfully used StarCoder, it's surprising that no integration guide or basic implementation for it (or any similar open-source LLM) is provided. Such an inclusion would have significantly enhanced the reproducibility and accessibility of their research.

mmaunder · on Dec 14, 2023

Regardless of whether this is verifiably new knowledge, it's an interesting case study when you consider limiting access to AI based on model size or some other regulatory measure, and the unfair advantage that confers on corporations that do discover new knowledge or laws of nature, and can monetize them without sharing.

arkadiytehgraet · on Dec 15, 2023

I have been wondering recently about a good test for checking whether LLMs can only parrot stochastically or do they manage to indeed learn some higher order concepts inside their weights.

What if we would take a bare LLM, train it only on the information (in Maths, physics, etc) that was available to, say, Newton and then try to prompt it to solve the problems that Newton solved? Will it be able to derive the calculus basics and stuff having never seen it before, maybe even with little bit of help from the prompts? Maybe it will be able to come up with something completely different, yet also useful? Or nothing at all...

zackmorris · on Dec 14, 2023

FunSearch is more along the lines of how I wanted AI to evolve over the last 20 years or so, after reading Genetic Programming III by John Koza:

https://www.amazon.com/Genetic-Programming-III-Darwinian-Inv...

I wanted to use genetic algorithms (GAs) to come up with random programs run against unit tests that specify expected behavior. It sounds like they are doing something similar, finding potential solutions with neural nets (NNs)/LLMs and grading them against an "evaluator" (wish they added more details about how it works).

What the article didn't mention is that above a certain level of complexity, this method begins to pull away from human supervisors to create and verify programs faster than we can review them. When they were playing with Lisp GAs back in the 1990s on Beowulf clusters, they found that the technique works extremely well, but it's difficult to tune GA parameters to evolve the best solutions reliably in the fastest time. So volume III was about re-running those experiments multiple times on clusters about 1000 times faster in the 2000s, to find correlations between parameters and outcomes. Something similar was also needed to understand how tuning NN parameters affects outcomes, but I haven't seen a good paper on whether that relationship is understood any better today.

Also GPU/SIMD hardware isn't good for GAs, since video cards are designed to run one wide algorithm instead of thousands or millions of narrow ones with subtle differences like on a cluster of CPUs. So I feel that progress on that front has been hindered for about 25 years, since I first started looking at programming FPGAs to run thousands of MIPS cores (probably ARM or RISC-V today). In other words, the perpetual AI winter we've been in for 50 years is more about poor hardware decisions and socioeconomic factors than technical challenges with the algorithms.

So I'm certain now that some combination of these old approaches will deliver AGI within 10 years. I'm just frustrated with myself that I never got to participate, since I spent all of those years writing CRUD apps or otherwise hustling in the struggle to make rent, with nothing to show for it except a roof over my head. And I'm disappointed in the wealthy for hoarding their money and not seeing the potential of the countless millions of other people as smart as they are who are trapped in wage slavery. IMHO this is the great problem of our time (explained by the pumping gas scene in Fight Club), although since AGI is the last problem in computer science, we might even see wealth inequality defeated sometime in the 2030s. Either that or we become Borg!

wolverine876 · on Dec 14, 2023

> not seeing the potential of the countless millions of other people as smart as they are who are trapped in wage slavery.

Think of the change from what built Silicon Valley, that anyone with a good idea could work hard and change the world. Many of SV's wealthy began that way.

But for you: They say the best time to plant a tree is 20 years ago; the second best time is now.

pbhjpbhj · on Dec 14, 2023

If tree-planting robots are uniquitous, and any tree you plant will get shaded by their established ranks of trees .. it might be nice to plant a tree now but it's not going to pay any bills.

wolverine876 · on Dec 15, 2023

It's not all, or even mostly what other people do that limits us. It's popular to claim powerlessness, but the social acceptability won't make the consequences any better.

lgessler · on Dec 14, 2023

Tl;dr in my words after a skim: this is a method for using LLMs to find new, SOTA (or at least very good?) heuristic algorithms for NP hard combinatorial optimization problems. They achieve this not by making the LLM itself smarter but by finding a way to keep only its best ideas. (Haven't had a chance to read carefully yet so caveat lector.)

Pretty impressive, but don't panic--we're not at the singularity yet.

rhosseinzadeh · on Dec 14, 2023

I wonder what would happen if this was applied to alpha code. Instead of generating 1 million codes for each problem during inference, do this kind of iteration during training and then generate less codes for each problem (or generate 1 million but hopefully better ones?).

anonzzzies · on Dec 15, 2023

For me, this is the only right way for LLMs to go. Use them to trim search space for robustly designed logical programs/algo's we then use to do things. Part of the chain, not the end 'solution' of the chain.

knicholes · on Dec 15, 2023

With the universal approximation theorem, we can use artificial neutral networks with ReLUs to accurately approximate a function. But we just get weights out of it at the end. This approach feels similar, but provides code in the end.

Ericson2314 · on Dec 16, 2023

Spitting out some Python is weak sauce. The next version of this I think will spit out Lean, or at the first great version. Then we'll be on to something.

chengdujin · on Dec 21, 2023

Enough talk, has anybody, replicated what Deepmind has done?

mnemotronic · on Dec 15, 2023

Garbage in. Garbage out. It's useless unless it can provide a functional mathematical proof of correctness or, at the least, provide a process by with it's fictional composition can be proven or disproven.

stevenhuang · on Dec 14, 2023

Wonder where the goalposts will be moved now by the "stochastic parrot" parroters.

lgessler · on Dec 14, 2023

Idk if this work really bears on that argument. I can imagine Bender reacting to this by observing that this is a clever way of finding the needle (a good heuristic algorithm for an NP hard problem) in a haystack (millions of other heuristic algorithms that are bad, very bad, or maybe not even syntactically well-formed), and then saying that the fact that the model still produced so much garbage is proof that the LLM still doesn't really know how to reason or access meaning.

And I think that'd be a good argument. OTOH, this system is not just an LLM, but an LLM and a bunch of additional components on top of it, and there might be different things to say about this system considered as a whole.

empath-nirvana · on Dec 14, 2023

I'm not sure why people always want to consider the LLM in isolation. Like if someone built an apparently fully sentient android which completely mimics the behavior of a sentient human in every respect, but it was comprised of a bunch of LLMs and other systems wired together, would you say that it's not intelligent because the LLMs aren't fully sentient?

airstrike · on Dec 14, 2023

We don't need LLMs to reason or to access meaning for them to be useful

lgessler · on Dec 14, 2023

Sure, but I don't think even Bender or Marcus would deny their potential for practical utility, right? I think they mean to say that LLMs are not exactly as miraculous as they might seem, and very capable of misbehavior.

janalsncm · on Dec 15, 2023

Bender and Marcus don’t get to decide what can be useful for people. LLMs are useful, although their imperfections limit their usefulness. But the bar for usefulness is a lot lower than “perfection”.

cbb330 · on Dec 14, 2023

your mind also conjectures many unreasonable and illogical solutions to things. then separately your mind has an "evaluator" to deduce a set of solutions down to things that make sense.

Is anyone saying LLMs in isolation and without contextual tooling are miraculous? Even the text box for chatGPT is a form of wrapping LLM in a UX that is incrementally more miraculous.

marmakoide · on Dec 14, 2023

A stochastic parrot used to generate variation of a program, to do hill climbing on programs.

The magician lose some its charm when you know his trick. The Earth used to be the center of the universe, now it's circling an average star. Intelligence used to be a sacred mystical thing, we might be in our way to make it less sacred.

wait_a_minute · on Dec 14, 2023

Is this the start of Singularity?