Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.

Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.



Agreed. I noticed a quick flyby of a bad “reasoning smell” in the baseball World Series simulation, though - it looks like it pulled some numbers from polymarket, reasoned a long time, and then came back with the polymarket number for the Dodgers but presented as its own. It was a really fast run through, so I may be wrong, but it reminds me that it’s useful to have skeptics on the safety teams of these frontier models.

That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.

On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.


Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.


How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?

https://x.com/arcprize/status/1943168950763950555


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions

That's kind of the idea behind ARC-AGI. Training on available ARC benchmarks does not generalize. Unless it does... in which case, mission accomplished.


Seems still possible to spend effort of building up an ARC-style dataset and that would game the test. The ARC questions I saw were not of some completely unknown topic, they were generally hard versions of existing problems in well-known domains. Not super familiar with this area in general though so would be curious if I'm wrong.


ARC-AGI isn't question- or knowledge-based, though, but "Infer the pattern and apply it to a new example you haven't seen before." The problems are meant to be easy for humans but hard for ML models, like a next-level CAPTCHA.

They have walked back the initial notion that success on the test requires, or demonstrates, the emergence of AGI. But the general idea remains, which is that no amount of pretraining on the publicly-available problems will help solve the specific problems in the (theoretically-undisclosed) test set unless the model is exhibiting genuine human-like intelligence.

Getting almost 16% on ARC-AGI-2 is pretty interesting. I wish somebody else had done it, though.


I’ve seen some of the problems before, like https://o3-failed-arc-agi.vercel.app/

This is not hard to build datasets that have these types of problems in them, and I would expect LLMs to generalize this well. I don’t see how this is any different really than any other type of problem LLMs are good at given they have the dataset to study.

I get they keep the test updated with secret problems, but I don’t see how companies can’t game this just by investing in building their own datasets, even if it means paying teams of smart people to generate them.


The other question is if enough examples of this type of task are helpful and generalizable in some way. If so, why wouldn't you integrate that dataset into your training pipeline of an LLM.


I use Grok with repomix to review my code and it tends to give decent answers and is a bit better at giving actual actionable issues with code examples than, say Gemini 2.5 pro.

But the lack of a CLI tool like codex, claude code or gemini-cli is preventing it from being a daily driver. Launching a browser and having to manually upload repomixed content is just blech.

With gemini I can just go `gemini -p "@repomix-output.xml review this code..."`


Well try it again and report back.


As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.


I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.


anecdotally, output in my tests is pretty good. It's at least competitive to SOTA from other providers right now.


I wish the coding models were available in coding agents. Haven't seem them anywhere.


Grok 4 is now available in Cursor.


I just tried it, it was very slow like Gemini.

But I really liked the few responses it gave me, highly technical language. Not the flowery stuff you find in ChatGPT or Gemini, but much more verbose and thorough than Claude.


I like that Grok doesn't kiss my ass like Gemini and ChatGPT keep doing with their "excellent idea!" -crap.


Interesting, I have the latest update and I don't see it in the models list.


I had to go to add more models, and then it was available. So far, it is able to do some things that other models were not previously able to do.


You have to go to the settings and view more models and select it from the drop-down list.


Plenty like Aider and Cline can connect to pretty much any model with an API.


Even if one does not have a positive view of Elon Musk, the catching up of Grok to the big three (Google, OpenAI, Anthropic) is incredible. They are now at the same level aproximately.


[flagged]


Well we have GPT-5 and Gemini 3 in the wings so it wouldn't be surprising if it is SOTA for a few days.


yup this will probably trigger the next wave of releases, someone had to go first.


xAI, with OAI just a few weeks before, were the first to get a cluster up of a sufficient size to train a GPT-5 like model. xAI released this as fast as they could, it hasn't been sitting on shelf for month, and neither has GPT-5.


> Seems like it is indeed the new SOTA model, with significantly better scores than o3

It has been demonstrated for quite some time that censoring models results in drastically reduced scores. Sure, maybe prevent it from telling somehow how to build a bomb, but we've seen Grok 3 routinely side with progressive views despite having access to the worst of humanity (and its sponsor).


Wait, are you implying that Grok 3 is "censored" because it aligns with "progressive" views?


I think they're implying that Grok is smarter because it's less censored, and then separately noting that it still tends to be fairly progressive despite the lack of censorship (when it's not larping as Hitler) even though it was presumably trained on the worst humanity has to offer.

Man, that sentence would have been incomprehensible just a couple years ago.


That's what I was going for.


As has been the case in almost all US social media companies until the last year. They were all heavily biased and censored towards left-leaning views.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: