> Not even close to all the data on the internet I agree with your other points,...

namaria · on May 20, 2023

You're severely underestimating how much content is on the internet and how hard it would be to see and index it all. Chat OpenAI used common crawl dataset, which is already pretty unwieldy and represents an amalgamation data gathered over several years by many crawlers.

revertmean · on May 20, 2023

Because if it was, it would mostly talk about porn? :)

yardstick · on May 20, 2023

There’s lots of paywalled content, and other content hidden behind logins and group memberships (Eg Facebook posts, University ex-alumni portals, University course portals).

Even the paywall issue alone, I can’t see how they could scale doing paywall signups automatically. Each paywall form is different, may require a local phone number in a different country to receive a text, etc.

hosh · on May 20, 2023

LLMs might be good enough to sign up for sites, though maybe not yet fool “I am a human” test.

wilg · on May 20, 2023

In addition to what others have said, there is a significant amount of data on the internet that is not in text form.