I agree with your other points, but why would you think ChatGPT was not given all the data on the internet?
If you aren't storing the text, the only thing that stops you retrieving all the pages that can possibly be found on the internet is a small amount of money.
I'm pretty certain that OpenAI has a lot more than a small amount of money.
You're severely underestimating how much content is on the internet and how hard it would be to see and index it all. Chat OpenAI used common crawl dataset, which is already pretty unwieldy and represents an amalgamation data gathered over several years by many crawlers.
There’s lots of paywalled content, and other content hidden behind logins and group memberships (Eg Facebook posts, University ex-alumni portals, University course portals).
Even the paywall issue alone, I can’t see how they could scale doing paywall signups automatically. Each paywall form is different, may require a local phone number in a different country to receive a text, etc.
I agree with your other points, but why would you think ChatGPT was not given all the data on the internet?
If you aren't storing the text, the only thing that stops you retrieving all the pages that can possibly be found on the internet is a small amount of money.
I'm pretty certain that OpenAI has a lot more than a small amount of money.