Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What do you mean? The paper [1] is full of benchmarks?

[1] GPT-4 Technical Report, https://cdn.openai.com/papers/gpt-4.pdf



Problem being its not a Research Paper, which they where doing previously. This is very bad state as you're not detailing anything that external parties can recreate or prove the scientific method.

They can exclaim the model says 40% less "xbox live gamer words" which people outside the company couldn't validate.

tl:dr OpenAi is now a business

Worth watch Yannic talk about the problem and other cool ML topics too. https://www.youtube.com/watch?v=2zW33LfffPc


It's not like a closed model only available to scientists you can't benchmark yourself. Benchmarking should also be done by a 3rd party otherwise we have a conflict of interest.


If this was a cpu/graphic cards sure lets benchmark it, worst case you getting less frames.

Here we'd need to see more about its design and safety, else you may be getting recipes for veggie dishes when what you really wanted was fried chicken.


How would knowing the architecture or safety mechanisms help you decide if it’s going to give incorrect results more than actual testing would?

I’m no LLM expert, but I don’t think you can eyeball the arch and say “that’s going to confuse veggies for fried chicken”.


https://aisnakeoil.substack.com/p/gpt-4-and-professional-ben...

"GPT-4 and professional benchmarks: the wrong answer to the wrong question OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots."


Just looking at the pictures and graphs in that paper is enough to become amazed by what they're achieving. The example where they show 3 pictures of an old monitor plug being connected to an iphone to recharge it, and then GPT4 is asked what's funny about it, and answers incredibly accurately, is amazing.


Since we don't have access to this feature lets be skeptical, its feels like "leading the witness," if your asks what be the funny here. Also if the image is from a forum or sub with funny images is that able to give it away?

Having multiple tests would be a stronger test say with example prompts: "whats going on in this picture", "what would a person think seeing this image" etc..

gpt4 is cool as a numbers box but this is not reasoning logic and without papers hasn't been proven either.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: