Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They do have to pay that.

But if it's not fair use, they'd need to negotiate a custom license on top of that, for every single thing they use.



Weird, I haven't gotten a check from OpenAI, Meta, Anthropic, or any other AI company for any of my works yet, nor have any of my writer, musician, developer, or photographer friends who also self-publish without permissive licenses that would allow for such use. Are you sure they have to compensate creators for the material they use for training, or are you misunderstanding how copyright licensing works in the United States? Because all of us put our contact methods on our works so folks can properly license it for use, yet none of us have had anyone reach out to do so for AI training - almost like there's a fundamental mismatch between what AI companies are willing to pay (nothing), and what humans who created this stuff would like to receive for its indefinite use in training (what these AI companies claim are) trillion-dollar businesses of the future that will revolutionize humanity (i.e., house money).

If it's fair use for OpenAI to steal content wholesale without fair compensation (as decided by the creator, unless they have granted the management of that license to a third-party) just to train AI models, then that opens a Pandora's Box where anyone can steal content to train their own models, creating an environment where copyright is basically meaningless. On the other hand, making it not fair use opens a different Pandora's Box, where these models have to be trained in fundamentally different ways to create the same outcome - and where countries like China, who notoriously ignore copyright laws, can leap ahead of the industry.

Almost like the problem is less AI, and more overly broad copyright laws. Maybe the compromise is slashing that window back down to something reasonable, like twenty to fifty years or so, like how we deal with patents.


> Weird, I haven't gotten a check from OpenAI, Meta, Anthropic, or any other AI company for any of my works yet, nor have any of my writer, musician, developer, or photographer friends who also self-publish without permissive licenses that would allow for such use.

Can you tell me the specific number of dollars that would be?

I interpreted "pay the price of each copyrighted work" as the sale price, a criticism of things like meta's piracy.

If there was a mandatory licensing regime that AI could use, and there was an exact answer for what the payment would be, I think it might make sense to use "the price" to talk about that license. But right now in today's world it's very confusing to use "the price" to talk about a hypothetical negotiation that has not happened yet, where many many works would never have a number available.


Where do they have to pay that?

Where have they paid for each artwork from DeviantArt, paheal, etc that they trained Stable Diffusion on?

Where have they paid for each independent blog post that they trained ChatGPT on?

Yes, they've made a few deals with specific companies that host a large amount of content. That's a far cry from paying a fair price for each copyrighted work they ingest. Nearly everything on the Internet is copyrighted, because of the way modern copyright works, and they have paid for nearly none of it.


Also, openai only started making deals (and mostly with news publishers) after the NYT lawsuit.

https://www.npr.org/2025/01/14/nx-s1-5258952/new-york-times-...

They didn't even consider doing this before. They still, as far as I know, haven't paid a dime for any book, or art beyond stock photography.

Lawsuit is still ongoing, if openai loses it might spell doom for legal production and usage of LLMs as a whole. There isn't enough open, free data out there to make state of the art AI.


> There isn't enough open, free data out there to make state of the art AI.

But there are models trained on legal content (like Wikipedia or StackOverflow). Also, no human needs to read millions of pirated books to become intelligent.


> But there are models trained on legal content (like Wikipedia or StackOverflow)

Literally all of them are trained on wikipedia and SO. But /none/ of them are /only/ trained on wikipedia and SO. They need much more than that.

> Also, no human needs to read millions of pirated books to become intelligent.

Obviously, LLM architectures that were inspired by GPT 2/3 are not learning like humans.

There has never been anything remotely good in the world of LLM that could have been said to have been trained on a moderate, more human scoped amount of data. They're all trained on trillions of tokens.

Models trained on less than 1T are experimental jokes that have no real use to provide.

You'll notice even so called "open data" LLMs like Olmo are, in fact, also trained on copyrighted data, datasets like Common Crawl claim fair use over anything that can be accessed from a web browser.

And then there's the whole notion of laundered data by training on synthetic data generated by another LLM. All the so-called "open" LLMs include a very significant amount of LLM-generated data. If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.


> If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.

It's fuzzy. I could imagine a situation where a primary LLM trained on copyrighted material is a big hazard and can't be released, but carefully monitored and filtered output could be declared copyright-safe, and then used to make a copyright-safe secondary LLM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: