It looks like ChatGPT length is 827 while LLaMA2 length is more than double at 1790.
Disclaimer from the site:
> Caution: GPT-4 may favor models with longer outputs and/or those that were fine-tuned on GPT-4 outputs.
> While AlpacaEval provides a useful comparison of model capabilities in following instructions, it is not a comprehensive or gold-standard evaluation of model abilities. For one, as detailed in the AlpacaFarm paper, the auto annotator winrates are correlated with length.
Also, Llama 2 is still a few percentage points below GPT-4.
Which is not close, because performance is logarithmic in training compute. Each additional percentage point of performance requires exponentially greater investment in compute during pretraining. Llama 2 was pretrained on 2 trillion tokens -- a significant investment in compute, for sure, but still not enough to get close to GPT-4.
Disclaimer from the site:
> Caution: GPT-4 may favor models with longer outputs and/or those that were fine-tuned on GPT-4 outputs.
> While AlpacaEval provides a useful comparison of model capabilities in following instructions, it is not a comprehensive or gold-standard evaluation of model abilities. For one, as detailed in the AlpacaFarm paper, the auto annotator winrates are correlated with length.