It depends on the eval, but I think it's fair to say that it's close. Here is the AGI Eval results organized into a table w/ averages (also I put in the new Hermes LLama2 13B model as well: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...
It beats out ChatGPT in every category except SAT-Math. We definitely need harder benchmarks.
It beats out ChatGPT in every category except SAT-Math. We definitely need harder benchmarks.
So far, there's BIG-Bench Hard https://github.com/suzgunmirac/BIG-Bench-Hard and just published, Advanced Reasoning Benchmark https://arb.duckai.org/