Then they are not the best. Most users aren't prompt engineers and grew up expecting to enter search terms into Google and get a result. If its the case OpenAI or Anthropic are best able to interpret user intent there's a good argument to be made they are the best.
If model trusts the users, and if user is dumb model will "weigh" user's input much higher and end up with flawed code.
If the model is more independent, it will find the right solution. If just want a dumb model which says yes to everything, and follows you when u are not at smart enough then you'll never end up with good solution if not by luck.
I am using AI to write full projects, complete code generation and haven found any model which comes close to Gemini Pro2.5 in code generation reasoning and generation.
While other models like qwen3, glm promise big in real code writing they fail badly, get stuck in loops.
The only problem right now i run into gemini is i get throttled every now and then with empty response specially around this time.
Once you setup a good system prompt on these, nothing really compares.
Most of the models you see with high benchmarks are not even comparable on real tasks.
qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini Pro2.5