I think people 100% have the right to use this on their images, but:
> simply acquiring only training data you have permission to use
Currently it's generally infeasible to obtain licenses at the required scale.
When attempting to develop a model that can describe photos for visually impaired users, I had even tried to reach out to obtain a license from Getty. They repeatedly told me that they don't license images for machine learning[0].
I think it's easy to say "well too bad, it doesn't deserve to exist" if you're just thinking about DALL-E 3, but there's a huge number of positive and far less-controversial applications of machine learning that benefit from web-scale pretraining and foundation models - spam filtering, tumour segmentation, voice transcription, language translation, defect detection, etc.
I don't believe it's a "doesn't deserve to exist" situation, because these things genuinely can be used for the public good.
However - and this is a big however - I don't believe it deserves the legal protection to be used for profit.
I am of the opinion that if you train your model on data that you do not hold the rights for, your usage should be handled similarly to most fair use laws. It's fine to use it for your personal projects, for research and education, etc. but it is not OK to use it for commercial endeavors.
> It's fine to use it for your personal projects, for research and education, etc. but it is not OK to use it for commercial endeavors.
Say I train a machine vision model that, after having pretrained on ImageNet or similar, detects deformities in a material for a small company that manufactures that material. Do you not think that would be fair use, despite being commercial?
To me it seems highly transformative (a defect detection model is entirely outside the original images' purposes) and does not at all impact the market of the works.
Moreover, you said it was "Baffling to see anyone argue against this technology" but it seems there are at least some models (like if my above detector was non-commercial) that you're ethically okay with and could be affected by this poisoning.
No, I don't generally think it's okay to profit off of the work of others without their consent.
>Moreover, you said it was "Baffling to see anyone argue against this technology" but it seems there are at least some models (like if my above detector was non-commercial) that you're ethically okay with and could be affected by this poisoning.
Just because I think there are situations where it's not ethically wrong to use someone's work without permission does not mean I think it's ethically wrong for someone to protect their work any way they see fit.
To use an extreme example: I do not think it's wrong for a starving man to steal food. I also do not think it's wrong for someone to defend their food from being stolen, regardless of the morality of the thieves' motivation.
> No, I don't generally think it's okay to profit off of the work of others without their consent.
I'd argue that essentially all work based off the work of others. I can only draw a "car" better than a random guess because I've seen many (individually copyrighted) car designs.
That's not to say we inherently have to treat use of statistical models the same, but rather that there does have to be a line somewhere to define where a new work, while dependant on previous works in aggregate, is sufficiently transformative - not carrying substantial similarity to any particular existing work that played a part in its creation - and can therefore be used by the author to make a living.
That line has to be placed in a way that prioritizes the progress of sciences and useful arts, rather than just enriching rightsholder megacorps like Getty Images/Universal Music. It should certainly allow training something like a tumor segmentation network, rather than rendering it infeasible.
Also, while whether it's morally okay is relevant and worth discussing, I think the question still stands of whether you believe my example would count as fair use, given the transformative nature and lack of impact on the market of the original work.
> Just because I think there are situations where it's not ethically wrong to use someone's work without permission does not mean I think it's ethically wrong for someone to protect their work any way they see fit.
> To use an extreme example: I do not think it's wrong for a starving man to steal food. I also do not think it's wrong for someone to defend their food from being stolen, regardless of the morality of the thieves' motivation.
I personally agree that they have a right to do so, but I don't think it'd be "baffling" that the [starving man/person training a tumor detector] would be against it, and it's likely not a "non-issue" for them to obtain sufficient [food/data] through other means.
Particularly since there are already means to opt-out that are respected by scrapers, and this is instead an attempt to do active damage. I guess in the analogy it'd be leaving out poisoned bread, although that's more extreme than I intend.
> simply acquiring only training data you have permission to use
Currently it's generally infeasible to obtain licenses at the required scale.
When attempting to develop a model that can describe photos for visually impaired users, I had even tried to reach out to obtain a license from Getty. They repeatedly told me that they don't license images for machine learning[0].
I think it's easy to say "well too bad, it doesn't deserve to exist" if you're just thinking about DALL-E 3, but there's a huge number of positive and far less-controversial applications of machine learning that benefit from web-scale pretraining and foundation models - spam filtering, tumour segmentation, voice transcription, language translation, defect detection, etc.
[0]: https://i.imgur.com/iER0BE2.png