LLaMA2 Chat 70B outperformed ChatGPT

Jackson__ · on July 27, 2023

*When asked by GPT4 to compare the outputs.

I'm a staunch believer that it would be foolish to rely on GPT4 for quality comparisons, and it has been mind boggling to see so many people do it and treat it as perfect proof of anything.

It would be slightly more understandable if there was a study to see how human and gpt4 preferences compare, but I'm unaware of any such thing.

letmevoteplease · on July 27, 2023

There is one: "The agreement between GPT-4 and humans reaches 85%, which is even higher than the agreement among humans (81%). This means GPT-4’s judgments closely align with the majority of humans. We also show that GPT-4’s judgments may help humans make better judgments. During our data collection, when a human’s choice deviated from GPT-4, we presented GPT-4’s judgments to humans and ask if they are reasonable. Despite different views, humans deemed GPT-4’s judgments reasonable in 75% of cases and are even willing to change their choices in 34% of cases."[1]

[1] https://arxiv.org/abs/2306.05685

sdenton4 · on July 27, 2023

That's helpful!

I've done a lot of work in audio synthesis, which is notoriously difficult measure. The gold-standard is human ratings of audio quality, but it is tough to design good tests (easy to fatigue raters) and the iteration time waiting for results is quite long.

Instead, there's now some projects which use neural networks trained on human ratings to predict audio quality, such as ViSQoL: https://github.com/google/visqol

This opens up fast iteration - scores going up generally corresponds to higher quality - followed by human testing at major milestones (eg, releasing a paper/model). VISQOL has a harder time comparing 'unrelated' models, IMO - ends up being not so great for comparison of different techniques, but excellent for measuring incremental improvement or catching regressions.

But, in the end, yes - you can use NN's to measure the quality of other NN's, so long as you're careful about it and make use of human raters from time to time as well.

The problem of test data getting into the training data seems to be an especially pernicious issue with LLM's, which isn't really arising in the audio synthesis space.

blackkettle · on July 27, 2023

I’ve started doing this with ASR hypotheses from colloquial spontaneous speech. It tends to have similar issues. Lots of shady human ground truth especially where addresses, alphanumeric sequences, repairs and repetitions and other essentially non read speech are concerned. The very large Whisper models are consistent in their transcription style and highly reliable as long as you pick strongly represented languages. And ChatGPT can do a very good job at comparing the linguistic coherence of hypotheses from multiple recognizers. Together these models can annotate, analyze and ingest far more data more consistently than human annotators at this point (at least in the best covered languages). We haven’t quite realized this as a community yet though, because the standard datasets we use for evaluation contain all these human inconsistencies. Wild times.

2c2c2c · on July 27, 2023

just curious, are there any open models doing the opposite of audio synthesis? As in able to generate the stems for a song?

sdenton4 · on July 28, 2023

Ha, I was working on neutral voice compression, which consists of an encoder which creates what the kids are calling tokens these days, and a decoder which synthesizes speech from the tokens.

AudioLM puts a language model on top of the compression tokens, and thus can generate speech or other audio. There's piles of recent papers pushing that approach into music generation. Mulan is a name that comes to mind.

Or maybe your interested in audio separation to get at the isolated instruments? There's lots of great work on that, as well. Like MixIT, which is an unsupervised audio separation system.

edude03 · on July 27, 2023

https://github.com/facebookresearch/demucs

bottlepalm · on July 27, 2023

It's funny how ChatGPT really does give you the most balanced, middle of the road answers. It feels like a distillation of all human knowledge and sentiments. I use it constantly to get advice on plans, architectures, thoughts, etc.. to get an idea of pretty much what the average person would think. It often points out things I've overlooked which I'll improve my design with and go back and forth with ChatGPT until we're both in agreement.

I even read a classic book the other day and had a great discussion with ChatGPT about moral relativism, the different schools of thought and how it fit into philosophy as a whole. For students this technology is incredible, I wish I had it for all my classes.

Even sometimes comments I'll make on here or Reddit I'll pass through ChatGPT first to see if I made any mistakes in my logic.

beepbooptheory · on July 27, 2023

Would put slight caution around asking it anything more than "what should I read next about this?"

For whatever reason*, it is particularly bad at discussing philosophy I find. When I was grading philosophy 101, I would have probably given it a passing grade against the overall curve, but that's about it. Philosophy is a discipline of careful, sometimes jargoney, and always very couched assertions that can be easily misunderstood and appropriated. This is probably its greatest weakness, and in many ways this weakness is the progenitor of philosophy itself in the Western world, with Plato at the start (i.e. with the figure of the sophist, the paradox of a false wisdom).

- Maybe one reason: there is a huge amount of, lets say, "armchair philosophy" on the internet, compared to other disciplines. Many blogposts and tiny manifestos of people really excited by some out of context quote from Spinoza or whatever. And you start to really feel this part of the dataset when you ask about philosophy. Many strange takes and misunderstandings.

bottlepalm · on July 27, 2023

I asked it where moral relativism fit in with philosophy and it came back with this

    Philosophy
        Ethics/Moral Philosophy
            Meta-Ethics: The study of moral thought, language, and properties
                Moral Realism: Belief that there are objective moral facts
                Moral Anti-Realism: Denial of the existence of objective moral facts
                    Moral Relativism: The belief that moral judgments are true or false only relative to some particular standpoint

I thought that was pretty good, what do you think?

beepbooptheory · on July 27, 2023

Haha I think it's fine. I think its kinda cheeky answering you so literally, giving it an actual place to fit into :).

I don't doubt it can do, like, Wikipedia type classification ok, but that's not like really getting to the substance of anything! And, either way, its not like there is one decided-upon hierarchy of concepts like this people consciously work within. This is a fine picture to some, but others might contest, perhaps, that Meta-Ethics is the "study of moral thought, language, and properties." What is "moral language" anyway? Why is it meta relative to Moral Philosophy writ-large? Or perhaps one might argue that we need to think of meta-ethics as a sibling rather than child. The whole discipline is a mess of different thoughts and possible rebuttals and grand intellectual overturnings that will not be captured here. Maybe just try pasting that back into the prompt and asking "what's wrong with this picture?".

But like I said, its fine in that its fairly comparable to Wikipedia for utility, (with IMO a worse interface, but I get why people like it more).

bottlepalm · on July 28, 2023

Yea I don’t think it was trained on any hierarchy in particular, ChatGPT literally came up with this on the fly. There’s even more to it I didn’t include to keep the post small. Just really impressive to have a conversation with it that goes all sorts of places. I don’t know any philosophy professors so this is the next best thing.

dgroshev · on July 27, 2023

The reason you feel that way might be that you are familiar enough with philosophy.

After all, LLMs and ChatGPT in particular are are indistinguishable from productised Gell-Mann Amnesia.

Edit: rewrote to be more neutral, sorry

TowerTall · on July 27, 2023

> to get an idea of pretty much what the average person would think

There is no such thing as an average person.

https://www.thestar.com/news/insight/when-u-s-air-force-disc...

Folcon · on July 27, 2023

Funnily enough I think you both might be right here, there isn't such a thing as an average person, but ChatGPT may be the synthesis of the average opinion.

wahnfrieden · on July 27, 2023

what is an average opinion? it is the sum of opinions which disagree with the result

oceanplexian · on July 27, 2023

It gives you the average opinion of someone on the Internet, particularly places like Reddit, which is FAR from the “average opinion” of most people. This has been a problem for a while, people keep assuming the Internet represents some kind of moral or ethical consensus on so many issues when it’s not even close.

mycall · on July 27, 2023

On the other hand, since AIs are taught using user content from Reddit, once other cultures start using these AIs, they will begin to conform more towards those norms (or will actively go against it and The Great Sort will continue forward).

JieJie · on July 27, 2023

Maybe "balanced" rather than "average" is a better way of putting it?

wahnfrieden · on July 27, 2023

What’s the balance between, say, opinions that trans people should be exterminated vs not? What’s the balance between Ukraine sovereignty vs Russian occupation? Etc

ben_w · on July 27, 2023

My general experience (not that topic) is that for anything even slightly approaching the sides of the Silicon Valley Overton Window, ChatGPT creates micro-essay saying "On the one hand, $foo, on the other hand, $bar. It's important to remember that $topic is controversial, and that many people disagree."

hobomatic · on July 27, 2023

It would involve serious consideration of the opinions you don't agree with, rather than just qualifying them in the most hyperbolic and dismissive way possible. Hopefully ai is better capable of this sort of reasoning than people are.

wahnfrieden · on July 28, 2023

kelseyfrog · on July 27, 2023

The other way to phrase this is high-dimensional spheres are "spikey".

sundarurfriend · on July 27, 2023

> what the average person would think

Keep in mind that due to the nature of the data and the RLHF training, it's more like a weighted average, something like

    0.5*(average American view) + 0.4*(average WEIRD view) + 0.1*(average human view).

(where WEIRD = Western, Educated, Industrialized, Rich, and Democratic, standard terminology in psychological research.)

This may or may not matter depending on the questions you're asking, just something to keep in mind.

lolinder · on July 27, 2023

80% agreement is high, but the margins between models at the top are so low that even that remaining 20% could be enough to alter the final rankings, depending on which direction it errs.

bostonsre · on July 27, 2023

Is perfect agreement possible? And what is the definition of agreement? Humans don't agree about much.. are we saying agreement means it matches the truth after intensive investigation by humans?

shanusmagnus · on July 27, 2023

Ugh, when I was doing my PhD work we were studying creativity in an experiment, and we needed an assessment for how creative different solutions were, and trying to get inter-rater reliability on this quite simple thing was just agonizing.

I wound up abandoning the experiment because getting enough reliability would have required screwing down the standards so tightly that it would have ruined the underlying point of the thing.

biomcgary · on July 27, 2023

Kind of hard to consistently evaluate creativity when someone might pull a James T. Kirk (https://en.wikipedia.org/wiki/Kobayashi_Maru), which is only creative the first time and just a cheat thereafter.

shanusmagnus · on July 29, 2023

Just one of many problems :)

drittich · on July 28, 2023

It's surprising to me that you describe creativity as a simple thing. How were you defining and measuring it?

shanusmagnus · on July 29, 2023

I don't (and didn't) think creativity was simple, but the task was super simple (alternate uses task). The fact that people couldn't agree on how creative the answers were to such a simple task was the insight, although maybe it was only an insight because I was naive.

bagels · on July 27, 2023

Not just that, but are the 80/20 randomly distributed? Probably not. These comparisons might have more in common with the 20%.

sdenton4 · on July 27, 2023

If only there were some way to produce some kind of "interval" of scores where you were confident that the actual score sat, and then had some way of comparing these intervals between the different models...

s17n · on July 27, 2023

But 80% agreement is the same as between humans

lolinder · on July 27, 2023

True, but given that you're using one of the models to judge the others, it's likely that the cases of disagreement will tend to favor GPT-4. You would never use one of the competitors as a judge among humans.

koalacola · on July 27, 2023

If it's trained by humans is it safe to assume that we'll get it so something crazy like 99% agreeable?

reaperducer · on July 27, 2023

it has been mind boggling to see so many people do it and treat it as perfect proof of anything.

The world has long been divided into two camps: People who think computers can make mistakes; and people who think computers never make mistakes, and blame the humans that program them.

Well, now the computers are programming themselves. And clearly they're making mistakes.

courseofaction · on July 29, 2023

I agree for data creation, but evaluation seems to have little risk of contaminating the outcomes when used with human validation.

alecco · on July 27, 2023

Better evaluation paints a bit different picture:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

*FreeWilly2 is a Llama2 70B model finetuned on an Orca style Dataset

EDIT: actually, impressive:

                   FreeWilly2  GPT-3.5  GPT-4
    ARC               71.1      85.2     96.3
    HellaSwag         86.4      85.5     95.3
    MMLU              68.8      70.0     86.4
    TruthfulQA        59.4      47.0     59.0

So reasoning (ARC) is lagging behind, but the other evaluations are at GPT-3.5 level and closing the gap with 4.

Source for GPT-3.5 and GPT-4.0 values (but mind it might not be the same # of shots)

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

Tostino · on July 27, 2023

That seems more in-line with my experience. I have been using GPT-3.5 and GPT-4 for data cleaning pipelines, and have tried to swap out LLaMA2 70B in a few of the "easier" tasks, and it hasn't performed well enough yet for any of my tasks done by GPT-3.5.

dr_kiszonka · on July 27, 2023

Hi! Could you please share a few words on what type of data you are cleaning using GPT? It is an intriguing idea and I would love to learn more to see if I could use a similar approach.

Tostino · on July 27, 2023

Youtube transcripts. It only works with single-person channels at the moment, as I haven't worked on disambiguating multiple speakers. They are very messy if they are just an auto transcription from Google. Practically unusable in most cases.

So first step in the pipeline is cleaning up the transcripts for incorrectly transcribed words or sentences. Using the context of the rest of the transcript, it is able to fix the vast majority of them. Then we add punctuation and format it with paragraphs. Then I have another check over the whole transcript for any remaining issues.

After all of that, I have a relatively clean transcript that represents the original audio very closely. From there, I am doing things like: 1. creating question/answer pairs from the transcript 2. creating a document of additional context that fills in details about what the speaker is talking about but may not have explicitly said 3. creating a summary of the transcript identifying the main purpose 4. creating a knowledge graph from the transcript with nodes and edges 5. creating an annotated version of the transcript using that knowledge graph

I plan on putting some of this data into a vector database, and some of it will be used for fine tuning LLaMA2 models on specific tasks (like knowledge graph creation, annotation using a knowledge graph, and writing using a knowledge graph to keep track of events)

dr_kiszonka · on July 27, 2023

That's very informative - thanks so much for the explanation!

My only experience with transcripts is in the context of transcribing short interviews. I used Whisper and it was pretty good. I mostly work with quantitative data, though.

In terms of the disambiguation of speakers, I haven't done it, but I remember blind signal separation discussed in a signal processing seminar I attended. There is also this paper, in case you haven't seen it already: https://enk100.github.io/speaker_separation/

Thanks again!

Tostino · on July 27, 2023

I have not read that yet, thank you!

Also, I haven't tried using Whisper for getting a transcription from the audio. I went the route of downloading the automatically generated transcripts from Youtube for a set of videos. An audio processing pipeline is definitely something I could add later though as an additional input channel for the overall pipeline.

agravier · on July 27, 2023

How do you represent knowledge in your knowledge graph? Do you use an existing open source ontology?

Tostino · on July 28, 2023

No, I just used YAML to represent it.

Here is an example: https://gist.github.com/Tostino/f6f19e88e39176452c1a765cb7c2...

Here is the transcript that I created that knowledge graph from, and then annotated with the knowledge graph for training purposes: https://gist.github.com/Tostino/e64524437848fbb3aebe52056df8...

Edit: I am using symbolic IDs intentionally. Reason for that was this paper: https://ai.googleblog.com/2023/07/symbol-tuning-improves-in-...

popinman322 · on July 27, 2023

Was this with or without fine-tuning?

Tostino · on July 27, 2023

That is with fine-tuning: https://stability.ai/blog/freewilly-large-instruction-fine-t...

Tostino · on July 28, 2023

I should be clear though, I didn't fine-tune on my specific tasks myself (as that is the end result of what I am doing with the pipelines outputs). I just tried the LLaMA2-chat model (which is fine-tuned), and the Freewilly 2 finetune.

sytelus · on July 27, 2023

LLaMA2 is far and away from GPT 3.5. Just look at HumanEval and other code generation metrics. All these GPT-4 based "chat evals" are extremely misleading and people should take it with a bag of salt.

alecco · on July 27, 2023

That's what I say in my comment, OP's ranking is quite misleading.

The ranking I linked and quoted in my comment's is much better. See the About tab. It has 4 evaluations and it doesn't use GPT-4 to evaluate.

Also the top one is a tuned Llama 2. Also clarified in my original comment.

swyx · on July 28, 2023

many usecases that arent code, + people can finetune llama2 with code. its a great start for a free model.

lhl · on July 27, 2023

It depends on the eval, but I think it's fair to say that it's close. Here is the AGI Eval results organized into a table w/ averages (also I put in the new Hermes LLama2 13B model as well: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

It beats out ChatGPT in every category except SAT-Math. We definitely need harder benchmarks.

So far, there's BIG-Bench Hard https://github.com/suzgunmirac/BIG-Bench-Hard and just published, Advanced Reasoning Benchmark https://arb.duckai.org/

cj · on July 27, 2023

It looks like ChatGPT length is 827 while LLaMA2 length is more than double at 1790.

Disclaimer from the site:

> Caution: GPT-4 may favor models with longer outputs and/or those that were fine-tuned on GPT-4 outputs.

> While AlpacaEval provides a useful comparison of model capabilities in following instructions, it is not a comprehensive or gold-standard evaluation of model abilities. For one, as detailed in the AlpacaFarm paper, the auto annotator winrates are correlated with length.

cs702 · on July 27, 2023

Also, Llama 2 is still a few percentage points below GPT-4.

Which is not close, because performance is logarithmic in training compute. Each additional percentage point of performance requires exponentially greater investment in compute during pretraining. Llama 2 was pretrained on 2 trillion tokens -- a significant investment in compute, for sure, but still not enough to get close to GPT-4.

europeanNyan · on July 27, 2023

There is a cool website where you can blind judge the outputs from LLaMa 2 vs ChatGPT-3.5: https://llmboxing.com/

Surprisingly, LLaMa 2 won 5-0 for me.

drew-y · on July 27, 2023

I got the opposite result. ChatGPT-3.5 won 5-0 for me. For me, LLaMa 2 gave longer answers that sometimes strayed away from the original question.

They both gave great answers overall though.

dsabanin · on July 28, 2023

Same for me. Interesting..

thorum · on July 27, 2023

In a response about the Turing test on this site, LLaMa 2 used the phrase “to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human” which appears to be copied verbatim from the first sentence of the Wikipedia article on the subject (as well as quite a few other pages in Google). Makes me wonder how many of the responses are just repeating and rephrasing memorized content written by humans, which will of course appear better, while ChatGPT makes more effort to avoid this (and might be able to generalize better to things it hasn’t memorized?).

user_7832 · on July 27, 2023

Thanks for the link!

At least in my examples, the llama output was more verbose/comprehensive. Sometimes ChatGPT didn't expand enough, sometimes Llama missed the mark entirely (eg explaining the Eiffel's architecture.)

andrei512 · on July 27, 2023

all the shorter answers were from GPT-3 - if you like long answers you pick llama 2...

SV_BubbleTime · on July 27, 2023

Interesting exercise, and llama won for me with 1 GPT answer… but it would be VERY easy to cherry pick these results and select a winner for most people.

speedgoose · on July 27, 2023

It was much closer to me. But llama 2 did surprisingly good. It’s looks like it’s a great alternative of chatGPT 3.5.

Tommstein · on July 27, 2023

Pretty cool. ChatGPT won the first one for me, then Llama 2 won the next five.

freedomben · on July 27, 2023

Llama2 beat ChatGPT 3.5 with a 92.66% win rate to 89.37%, but lost to GPT-4 which got 95.28%. Still pretty amazing though!

cs702 · on July 27, 2023

Not really close, because performance is logarithmic in training compute.

That is, each additional percentage point of performance requires exponentially greater investment in compute during pretraining.

Llama 2 was pretrained on 2 trillion tokens -- a significant investment in compute, for sure, but still not enough to get close to GPT-4.

And this is only one benchmark.

generalizations · on July 27, 2023

* ChatGPT 3.5. But it's also within spitting distance of GPT4, which is very exciting.

valine · on July 27, 2023

GPT4 is more difficult to measure I think. The value I get from GPT4 is in the details it gets right on very obscure, complex questions. I'm not sure benchmarks are capturing how far GPT4 is ahead of other models. For simple stuff it's not that much better than 3.5.

make3 · on July 27, 2023

performance is logarithmic as a function of money invested in compute, so maybe it's close but it's also far away

heliophobicdude · on July 27, 2023

The benchmark I care about the most for my development workflow is on structured output.

Paul Gauthier made this benchmark [1] to measure correct git diffs. If you ask GPT-4 for help with your code, it can output a change in a git diff more reliably than 3.5.

My hope is that we can do that with Llama 2.

1:https://aider.chat/docs/benchmarks.html

RcouF1uZ4gsC · on July 27, 2023

Those MacBook Pros with 96 GB of unified GPU/CPU memory are looking pretty good right now.

It would be awesome to have all this running on a laptop in a completely offline mode.

TillE · on July 27, 2023

I think it'd be more fun to spend an extra $700 and get an M2 Ultra Mac Studio with way more GPU cores and 128GB of RAM, and set up a private server.

But if you really want a portable offline thing, sure.

shanusmagnus · on July 27, 2023

I get more excited at the prospect of popping into some random cafe, SSHing from my iPad into some vast.ai server and setting loose this giant AI brain on whatever stuff. Feels badass.

aantix · on July 27, 2023

What's the most straight forward way of downloading LLaMA2, and training it with additional documents?

I have a whole host of personal pdf's and documentation that I would love to be able to ask questions about.

ubj · on July 27, 2023

This may be relevant:

https://www.sematic.dev/blog/tuning-and-testing-llama-2-flan...

It's the most straightforward explanation I've found so far. I'd love to hear if anyone's found something better though.

accrual · on July 27, 2023

Does this mean it may be possible to self-host a ChatGPT clone assuming you have a 70B model? I've used a 13B model with LLaMA1 and it's surprisingly good, but still nowhere near ChatGPT for coding questions.

lhl · on July 27, 2023

You will want to look at HumanEval (https://github.com/abacaj/code-eval) and Eval+ (https://github.com/my-other-github-account/llm-humaneval-ben...) results for coding.

While Llama2 is an improvement over LLaMA v1, it's still nowhere near even the best open models (currently, sans test contamination, WizardCoder-15B, a StarCoder fine tune is at top). It's really not a competition atm though, ChatGPT-4 wipes the floor for coding atm.

riku_iki · on July 27, 2023

this all numbers can be missleading, and simply indicate that gpt have these tasks in training data, and another model doesn't.

lhl · on July 27, 2023

While there's some contamination, it's not like the community isn't aware of it. For example, here's this discussion: https://huggingface.co/sahil2801/replit-code-instruct-glaive...

This was the HumanEval contamination one dev measured: ``` replit_glaive: 56.71% replit: 7.32% wizard: 4.88% ```

From the WizardCoder paper https://arxiv.org/pdf/2306.08568.pdf you can see that it hits SOTA (for open models) in not just HumanEval and HumanEval+, but also MBPP and DS-1000 as well, so it's not a one off.

For those interested in reading more about various considerations for coding models, I highly recommend reading the MSR phi-1 paper: https://arxiv.org/pdf/2306.11644.pdf

Looking forward to if they ever publish code/model/dataset since it has extremely strong performance trained on a very small number of tokens very manageable 1.3B and 350M parameter models.

riku_iki · on July 27, 2023

> in not just HumanEval and HumanEval+, but also MBPP and DS-1000 as well, so it's not a one off.

and how do you know all these benchmarks not leaked? I think they all scrapped from web sites, the same as training data for LLM, so risk of contamination is extremely high.

The best way to measure this is through synthetic datasets, which generate new tasks every time and model can't memorize them during training. One example is BigBench has multiple such tasks, but researchers usually(always) not regenerating those datasets.

marcosdumay · on July 27, 2023

I imagine that if you take the time to specialize it, you suddenly have a model that is better than anything from the large players on all the cases that you care about.

But, well, I am currently not hyped enough about it to actually try.

ramesh31 · on July 27, 2023

>Does this mean it may be possible to self-host a ChatGPT clone assuming you have a 70B model?

Not only possible but quite easy. Inference for 70B can be done with llama.cpp using CPU only, on any commodity hardware with >64GB of RAM

beefield · on July 27, 2023

I have 64gb on my 5 year old thinkpad. What kind of performance (tokens per sec) I could expect on that nowadays for a 70B model?

jerrygenser · on July 27, 2023

Llama cpp speed is dramatically improved by avx instructions. If your CPU has those it would be much faster than not.

And if it doesn't you need to do some workarounds with compiling and it gets a bit harder to run.

Zambyte · on July 27, 2023

When you say "coding questions" do you mean questions that should be answered by producing code, or questions about code ("explain this")? Or both?

rvz · on July 27, 2023

Possibly. Might need to be further optimized in size and 4-bit quantisation, perhaps and then you have a scaleable and fast self-hosted AI model.

Lets just hope that there won’t be any embarrassing vulnerabilities coming out of this when someone could prompt the model to reveal its own environment variables or API keys or the internal prompt that it is using.

But it seems the $0 free AI models are eating OpenAI’s lunch and Meta so far is winning the race to zero.

knodi123 · on July 27, 2023

Yeah, my experience has been that every one of these freely downloadable models can be measured as "percent of chatgpt quality". and getting up to 85% is shockingly good.

*edit: oops, my brain inserted "by" in the middle of "outperformed chatgpt". I'll leave my wrong comment up as a testament to shame.

fullshark · on July 27, 2023

The reality is probably some queries ChatGPT outperforms and vice versa. Regardless the premise that ChatGPT's secret sauce could be hidden forever is very dead.

thewataccount · on July 27, 2023

The fact Guanaco 33B is at 65% with Vicuna 13B at 70.43% immediately makes these results non-sensical from my own experience with them. Heck from my experience Guanaco 33B is better then Vicuna 33B!

Not to mention GPT4 at 95% and ChatGPT at 89% - I use chatgpt(3.5turbo)/gpt4 daily for work, and I rarely ever bother with 3.5turbo because of how unreliable it's answers are compared to gpt4.

So whatever this is effectively measuring is useless for comparing these models, especially across work types.

exo-pla-net · on July 27, 2023

I wish it were convention to specify the model, such as "gpt-3.5", rather than "ChatGPT", which is a service that hosts multiple models. Talking about ChatGPT creates pointless ambiguity.

(But maybe it's a good filter: if someone is talking about "ChatGPT's" performance, they probably don't have anything useful to say.)

weare138 · on July 27, 2023

Whats up with WizardLM-13B-V1.2? I don't know anything about it but the description says it's based on Llama-2 with only 13B parameters and it's holding it's own in the top 5 with a fraction of the model size.

treprinum · on July 27, 2023

Isn't Wizard one of the uncensored versions like Luna etc.?

api · on July 27, 2023

LLaMA2 seems to compete with ChatGPT 3.5, which is great. It's nowhere near as large as GPT-4 so I would not expect it to be competitive with that.

GPT-4 level models that regular people can run with a reasonable hardware budget are going to require innovations in optimization and model efficiency beyond just quantizing weights. Rumor has it that GPT-4 is a "committee" of ~220G models, which would require ~128GiB VRAM at 4-bit quantization to run each model.

lhl · on July 27, 2023

This was just posted a few hours ago and when I tried it they were neck and neck (for me, LLama 2 won by a 1 question, but it was close): https://llmboxing.com/

It looks like the eval is open sourced so you could easily build a version w/ your own questions for blind testing...

charcircuit · on July 27, 2023

At least when I tried Llama 2's response was always the longer one so it was hard to remain unbiased.

kuprel · on July 27, 2023

Was LLaMA2 potentially trained on the raw text from this page? https://huggingface.co/datasets/tatsu-lab/alpaca_eval/viewer...

mk_stjames · on July 27, 2023

Has anyone published a similar run of benchmarks with llama2 70B but at different quantization levels? I assume this benchmark is evaluated on the base model run at FP16. How much does it lose quantizing to INT8?

kosolam · on July 27, 2023

Is there some free service that allows chatting with the 70b llama2?

Oranguru · on July 27, 2023

Yes, check out: https://huggingface.co/chat/

You can easily opt out of the data sharing.

whinvik · on July 27, 2023

llama2.ai

paxys · on July 27, 2023

There are more ways of evaluating LLMs than there are LLMs. All of the "X is better than Y" statements are pointless unless there is a very clear consensus.

luckystarr · on July 27, 2023

The value of GPT-4 also lies in its stored knowledge. A 70B model can't store that much.

lolinder · on July 27, 2023

The advantage of LLaMA 2 is that a company can fine tune it on the knowledge that they actually care about and then run it on their own hardware without paying API fees or relying on an unstable dependency that's constantly being tweaked.

meepmorp · on July 27, 2023

> without paying API fees or relying on an unstable dependency that's constantly being tweaked.

and without handing a whole bunch of data to a 3rd party and hope they're securing it properly

lhl · on July 27, 2023

Note, that just because you're hosting itself won't mean you're securing it properly... Here's a just published injection attack that only works on open source models (public model weights): https://twitter.com/random_walker/status/1683833600196714497

huggingmouth · on July 28, 2023

There's a world of difference between visiting links hallucinated by an unreliable ai and having a third party store (and possibly sell) every single thing you say to an ai and tying it to your identity, forever.

Also there is nothing about that attack that makes it iinherently only applicable to self hosted models.

lhl · on July 28, 2023

That specific attack requires adversarial inputs crafted crafted against gradients so only works against open models (requires known model weights). There are dangers that include leaking PII from your current context, but also worse if you are using the model with RAG or with other types of system access so I don't think it's as innocuous as you are assuming.

mensetmanusman · on July 27, 2023

That will be possible with cloud AI in the future. On prem will always be less compute capable unless you have your own GPU cluster to rival the FAANGs, that is why meta is releasing this for free.

lolinder · on July 27, 2023

This is the key part of what I said:

> without paying API fees or relying on an unstable dependency that's constantly being tweaked

I see no evidence that this part will be possible with OpenAI. Usage fees will always be a thing because that's how they make money, and based on what I've heard from people who have actually tried to build on their APIs, I would not trust them to keep the model stable. There are always new safety features they need to add, and those changes break things.

golergka · on July 27, 2023

But unlike ChatGPT, it's still exclusively English, right?

ChatGTP · on July 28, 2023

We cannot let the Chinese or Russians access this tech \s

0xbadc0de5 · on July 27, 2023

And Vicuna-33B is not far behind - which you can actually run on a 24GB 3090/4090 GPU unlike LLaMAv2-70B. Although for a lot of tasks, Guanaco outperforms Vicuna.

iandanforth · on July 27, 2023

Isn't the chat version of llama 2 trained on gpt-4 output, hence it's non-commercial license (as opposed to the base model) or am I just making things up?

coldblues · on July 27, 2023

Awesome. What's great is that it can be unfiltered as well. So we'll be able to have models without all that incessant apologizing.

nmfisher · on July 27, 2023

I haven't had a chance to use the GPT-4 API yet - is it that much better than the GPT-4 available via ChatGPT? Or am I misunderstanding?

Fergusonb · on July 27, 2023

Anecdotal evidence here - I find that the API is less likely to ask questions about what you are doing and get straight to the answer.

For example, if I were to ask how to do something with burp it will just answer instead of going into the "as an AI" monologue.

fzzzy · on July 27, 2023

Do something with burp?

NegativeK · on July 27, 2023

Burpsuite.

freedomben · on July 27, 2023

ChatGPT uses the GPT-4 API, so it's the same. With the API directly though you can change the system prompt, which can enable better results if you know what you're doing.

bkanber · on July 27, 2023

ChatGPT is a wrapper to the GPT Completion API with some sane defaults. With a new beta feature you can edit the system prompt via ChatGPT, but you still can't adjust the other parameters you can reach with the API.

bazmattaz · on July 27, 2023

ChatGPT uses the GPT-4, but there are conspiracy theories circling that ChatGPT is neutered and thus not as good as GPT-4 through the API. The theory being that OpenAI are thottling the free version of GPT-4 (ChatGPT)

bestcoder69 · on July 27, 2023

Free ChatGPT runs 3.5. You have to upgrade to plus to use GPT-4. The APIs seem close to ChatGPT, but it’s a little opaque what they’re actually doing. If you inspect network requests the models are named something like “chat-render-3.5” instead of the API model names.

I’d imagine OpenAI might run experiments on ChatGPT that they wouldn’t on the API, to avoid breaking 3rd party applications unannounced.

cosmojg · on July 27, 2023

That's, uh, not a conspiracy theory. The free version of ChatGPT uses an entirely different model on the backend.

zo1 · on July 27, 2023

I use the GPT4 API and still think it got neutered since I first started using it.

krisknez · on July 27, 2023

I've played a little bit around llama2 and gpt3.5 is still better but llama2 is not far behind.

seydor · on July 27, 2023

this can apparently run on 48GB

treprinum · on July 27, 2023

2xA6000 NVLinked Ampere can run 70B 8-bit which is almost as good as fp16. I bought another A6000 just for that.

lolinder · on July 27, 2023

When quantized to 4 bits, yes. You lose some quality by doing that, though, as compared to the full f16.

evilduck · on July 27, 2023

From what I've gathered when reading up on this topic, if RAM is your constraint the common thought has been that higher parameter models quantized down to smaller sizes will outperform lower parameter models running at higher quantization, i.e. it may still be preferable to use the 70B Llama model quantized to 4-bits than something like an unquantized f16 Falcon 40B or the "coming soon" f16 33B Llama2.

lolinder · on July 27, 2023

Yes, that is true! But you lose enough performance that comparisons to GPT-3.5 stop working.

jacknews · on July 27, 2023

on their own metric?