Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not a lawyer, but it seems like you stood up a straw man there.

>Just because it's represented as a bunch of numbers does not make it non copyrightable.

Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable? Taking music and encoding it as a wav file is not a creative work, but it's a representation of a copyrighted work.

Maybe you could create a long list of numbers and call it an artistic impression, but that's clearly not what AI weights are. I'm interested to hear an example of your copyrightable numbers.



"Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable?"

Sure, there are "poems" that consist of just a groups of numbers that are copyrighted. They are not encodings, it's just a string of numbers. It's indistinguishable from a bunch of numbers. This is just one example, there are lots.

They are enforceable to the degree it's creative, and to the degree the infringing use is also creative.

So you would not be able to sue me for using those numbers in a math equation. You would be able to sue me for reproducing your poem in a book of poems :)

As feist says, the creativity required for copyright is quite minimal. But it's still only as protectable as it is creative.

Look - AI is not the first thing to have this "issue". The answer remains the same as it always was - it's mostly about the process not the output.

The output mostly matters is if the output is not intended to be creative (or it's de minimis or ...).

Copyright as it currently exists is weird.

Like if you go to the copyright office and try to register your ssh public key and say "this was generated by ssh-keygen i had nothing to do with it" you may get a different result than if you said "this is my new visually stunning masterpiece, my ssh public key, which was generated with computer help but I used 37 precisely timed keyboard smashes to do it. Prints are available from my gallery for $500"


I fully agree with what you say, with one bit of nuance to point out:

> Like if you go to the copyright office and try to register your ssh public key and say "this was generated by ssh-keygen i had nothing to do with it" you may get a different result than if you said "this is my new visually stunning masterpiece, my ssh public key, which was generated with computer help but I used 37 precisely timed keyboard smashes to do it. Prints are available from my gallery for $500"

The important thing, of course, isn't whether the copyright office denies to register your copyright, but instead what courts will ultimately do when you attempt to enforce your copyright.

We know the current administrative algorithms used by the copyright offices. We have less clarity on what courts will ultimately do.


The key factor of Feist v Rural is whether there was any original or creative process in the way the facts were arranged.

Here, there's a whole lot of creative decisions in labelling and guiding of training that produces the weights, so it's reasonable to think it might be copyrightable.

That is, the numbers are a whole lot more original than the issuance of phone numbers or part numbers.


The requirement for expertise doesn’t necessarily imply that that setting up perimeters for training AI is necessarily copyrightable. A normal brick wall for example needs skills to create but doesn’t qualify as the goal is not creative. If so the mechanical output of a process that doesn’t qualify for copyright is not going to qualify.

Labeling training data may qualify for copyright, but if the underlying training data doesn’t taint the output as a derivative work then labeling isn’t going to qualify by itself.

Thus without some new and very generous interpretation AI companies are at best not going to benefit from copyright and at worst may be forced to create all training data in house. My suspicion is this generation of AI companies are in a very difficult situation.


> but if the underlying training data doesn’t taint the output as a derivative work then labeling isn’t going to qualify by itself.

It depends. If each individual training item has a small impact on the output coefficients, then perhaps it's not a derivative work of them. But if there's a large creative process in determining model training procedure, deciding labelling strategies, and applying those-- perhaps those numbers are strongly derived from those things.


That sounds like wishful thinking, individual training items have significant impact on the result.

Anyway, suppose you’re building an AI to walk, there’s nothing creative about selecting 9.8m/s/s for gravity that’s simply the ideal value to achieve a desired goal. Labeling an elephant as “Elephant” rather than “coat hanger” is similarly a functional choice.

Just because a person is holding a camera and taking a photo doesn’t mean the result is copyrightable.


> Anyway, suppose you’re building an AI to walk, there’s nothing creative about selecting 9.8m/s/s for gravity that’s simply the ideal value to achieve a desired goal.

Suppose you're not building a strawman, but instead building an AI to be an LLM. The exact sequence of what you choose to do for instruction tuning, and the metrics and labels that you choose, the prompt/response pairs you write, and the loss functions you employ are quite creative. They greatly affect the coefficients and are not simple mechanical steps and are the result of a large amount of creative choice.

We are nowhere near a point where they are an uncreative, mechanical recipe to follow.

> Just because a person is holding a camera and taking a photo doesn’t mean the result is copyrightable.

No, but in the overwhelming majority of circumstances it is. What it depends upon is whether the person holding the camera is making a significant, original creative choice.

I am not sure what courts will decide, but I am certain that there is more creativity and originality employed than you are giving OpenAI et al. credit for.


> not simple mechanical steps and are the result of creative choice

Creative choices requires intentional control over the output across a meaningfully different range of viable possibilities. A brick layer has a huge range of viable options in the specific brick and its alignment in a wall but none of those choices are artistically meaningful.

The coefficients are also not in any meaningful sense chosen based on instruction tuning. It’s no more under direct control than the specific arrangements of atoms in the brick wall and is instead the output of a purely mechanical process.

> We are nowhere near a point where they are an uncreative, mechanical recipe to follow.

Thus: We are nowhere near the point where the output is under creative control rather than being the result of a poorly understood mechanical recipe.


I don't really agree.

Here's what the supremes said in Feist V. Rural:

> Factual compilations, on the other hand, may possess the requisite originality. The compilation author typically chooses which facts to include, in what order to place them, and how to arrange the collected data so that they may be used effectively by readers. These choices as to selection and arrangement, so long as they are made independently by the compiler and entail a minimal degree of creativity, are sufficiently original that Congress may protect such compilations through the copyright laws. Nimmer ss 2.11[D], 3.03; Denicola 523, n. 38. Thus, even a directory that contains absolutely no protectible written expression, only facts, meets the constitutional minimum for copyright protection if it features an original selection or arrangement.

Alphabetical order wasn't quite enough. But people directing the work that produces the coefficients are doing considerably more creative work than that.

> Thus: We are nowhere near the point where the output is under creative control rather than being the result of a poorly understood mechanical recipe.

No one requires complete creative control of the output. I can spatter paint and have relatively poor control of what's happening, but I am certainly generating a copyrightable work when I engage in creative choices as part of this.


> Alphabetical order wasn’t quite enough, But people directing the work that produces the coefficients are doing considerably more creative work than that.

I agree it’s more effort but the metric isn’t effort so I disagree that qualifies the coefficients as copyrightable. The SHA256 hash of a movie isn’t copyrightable even though the movie itself was.

> No one requires complete creative control

That’s a strawman, there are requirements for creative control. You don’t own copyright to your normal dumps, but you can get copyright from looking down and selecting to take a picture. That’s the low bar for a creativity requirement, but it exists.


> I agree it’s more effort but the metric isn’t effort so I disagree that qualifies the coefficients as copyrightable.

I know the metric is no longer effort. But there's a lot of creative choices that I've mentioned that greatly affect the coefficients, even if we don't know what those creative choices are going to do to each film grain in the photograph or coefficient in the matrix.

> You don’t own copyright to your normal dumps

Yes, there's an explicit exemption in LOC's guidelines for things that are the direct output of natural processes.

If you have a lot of choices affecting output, then the output is subject to copyright. Indeed, the Supremes above said that factual contemplations can qualify if they involve a "minimal degree of creativity".

In the end, we'll see.


IANAL, but I’d wonder whether ‘creativity’ is really present in labelling - and indeed, mightn’t it be the last thing you want? I’d argue labelling should be strictly factual and reproducible, and ideally following a logical structure… maybe akin to how addresses of buildings might appear in a phone directory…

(Agree that the skill in knowing how to code and guide the training of a model is probably very different though. It’s not just access to compute time that separates me from OpenAI :) )


> Here, there's a whole lot of creative decisions in labelling and guiding of training that produces the weights

Often, labelling is part of large public datasets that are chosen for use for that exact reason, and/or is otherwise not the work of the party claiming copyright in the model.


> Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable? Taking music and encoding it as a wav file is not a creative work, but it's a representation of a copyrighted work.

There are random number books and I know one of them has a copyright registration [1].

[1]: https://publicrecords.copyright.gov/detailed-record/7060844




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: