Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Where do they have to pay that?

Where have they paid for each artwork from DeviantArt, paheal, etc that they trained Stable Diffusion on?

Where have they paid for each independent blog post that they trained ChatGPT on?

Yes, they've made a few deals with specific companies that host a large amount of content. That's a far cry from paying a fair price for each copyrighted work they ingest. Nearly everything on the Internet is copyrighted, because of the way modern copyright works, and they have paid for nearly none of it.



Also, openai only started making deals (and mostly with news publishers) after the NYT lawsuit.

https://www.npr.org/2025/01/14/nx-s1-5258952/new-york-times-...

They didn't even consider doing this before. They still, as far as I know, haven't paid a dime for any book, or art beyond stock photography.

Lawsuit is still ongoing, if openai loses it might spell doom for legal production and usage of LLMs as a whole. There isn't enough open, free data out there to make state of the art AI.


> There isn't enough open, free data out there to make state of the art AI.

But there are models trained on legal content (like Wikipedia or StackOverflow). Also, no human needs to read millions of pirated books to become intelligent.


> But there are models trained on legal content (like Wikipedia or StackOverflow)

Literally all of them are trained on wikipedia and SO. But /none/ of them are /only/ trained on wikipedia and SO. They need much more than that.

> Also, no human needs to read millions of pirated books to become intelligent.

Obviously, LLM architectures that were inspired by GPT 2/3 are not learning like humans.

There has never been anything remotely good in the world of LLM that could have been said to have been trained on a moderate, more human scoped amount of data. They're all trained on trillions of tokens.

Models trained on less than 1T are experimental jokes that have no real use to provide.

You'll notice even so called "open data" LLMs like Olmo are, in fact, also trained on copyrighted data, datasets like Common Crawl claim fair use over anything that can be accessed from a web browser.

And then there's the whole notion of laundered data by training on synthetic data generated by another LLM. All the so-called "open" LLMs include a very significant amount of LLM-generated data. If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.


> If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.

It's fuzzy. I could imagine a situation where a primary LLM trained on copyrighted material is a big hazard and can't be released, but carefully monitored and filtered output could be declared copyright-safe, and then used to make a copyright-safe secondary LLM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: