What do you mean? The paper [1] is full of benchmarks? [1] GPT-4 Technical Repor...

amrb · on March 21, 2023

Problem being its not a Research Paper, which they where doing previously. This is very bad state as you're not detailing anything that external parties can recreate or prove the scientific method.

They can exclaim the model says 40% less "xbox live gamer words" which people outside the company couldn't validate.

tl:dr OpenAi is now a business

Worth watch Yannic talk about the problem and other cool ML topics too. https://www.youtube.com/watch?v=2zW33LfffPc

siva7 · on March 21, 2023

It's not like a closed model only available to scientists you can't benchmark yourself. Benchmarking should also be done by a 3rd party otherwise we have a conflict of interest.

amrb · on March 21, 2023

If this was a cpu/graphic cards sure lets benchmark it, worst case you getting less frames.

Here we'd need to see more about its design and safety, else you may be getting recipes for veggie dishes when what you really wanted was fried chicken.

brookst · on March 21, 2023

How would knowing the architecture or safety mechanisms help you decide if it’s going to give incorrect results more than actual testing would?

I’m no LLM expert, but I don’t think you can eyeball the arch and say “that’s going to confuse veggies for fried chicken”.

kgarten · on March 21, 2023

https://aisnakeoil.substack.com/p/gpt-4-and-professional-ben...

"GPT-4 and professional benchmarks: the wrong answer to the wrong question OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots."

brabel · on March 21, 2023

Just looking at the pictures and graphs in that paper is enough to become amazed by what they're achieving. The example where they show 3 pictures of an old monitor plug being connected to an iphone to recharge it, and then GPT4 is asked what's funny about it, and answers incredibly accurately, is amazing.

amrb · on March 21, 2023

Since we don't have access to this feature lets be skeptical, its feels like "leading the witness," if your asks what be the funny here. Also if the image is from a forum or sub with funny images is that able to give it away?

Having multiple tests would be a stronger test say with example prompts: "whats going on in this picture", "what would a person think seeing this image" etc..

gpt4 is cool as a numbers box but this is not reasoning logic and without papers hasn't been proven either.