A ruling in a U.S. District Court has effectively given permission to train artificial intelligence models using copyrighted works, in a decision that’s extremely problematic for creative industries.
Content creators and artists have been suffering for years, with AI companies scraping their sited and scanning books to train large language models (LLMs) without permission. That data is then used for generative AI and other machine learning tasks, and then monetized by the scraping company with no compensation for the original host or author.
Following a ruling by a U.S. District Court for the Northern District of California issued on Tuesday, companies are being given free rein to train with just about any published media that they want to harvest.
The ruling is based on a lawsuit from Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson against Anthropic dating back to 2024. The suit accused the company of using pirated material to train its Claude AI models.
This included Anthropic creating digital copies of printed books for AI model training.
The ruling from Judge William Alsup — a judge very familiar to readers of AppleInsider — rules in favor of each side in various ways. However, the weight of the ruling certainly sides with Anthropic and AI scrapers in this instance.
Under the ruling, Judge Alsup says that copies used to train specific LLMs was justifiable as fair use.
“The technology at issue was among the most transformative many of us will see in our lifetimes,” Alsup commented.
For physical copies that were converted from a print library to a digital library, this was also deemed fair use. Furthermore, using that content to train LLMs was also fair use.
Alsup compared the author’s complaint to if the same argument was used against an effort to train schoolchildren how to write well. It’s not clear how that applies, given that artificial intelligence models are not considered “schoolchildren” in any legal sense.
In that argument, Alsup ruled that the Copyright Act is intended to advance original works of authorship, not to “protect authors against competition.”
Where the authors saw a small amount of success was in the usage of pirated works. Creating a library of pirated digital books, even if they are not used for the training of a model, does not constitute fair use.
That also remains the case if Anthropic later bought a copy of a pirated book after pirating it in the first place.
On the matter of the piracy argument, the court will be holding a trial to determine damages against Anthropic.
In May, it was reported that Apple was working with Anthropic to integrate the Claude Sonnet model into a new AI-powered version of Xcode, to help reshape developer workflows.
Bad news for content makers
The ruling is terrible for artists, musicians, and writers. Other professions where machine learning models could be a danger to their livelihoods will have issues too — like judges who once said that they took a coding class once, and therefore knew what they were talking about with tech.
AI models take advantage of the hard work and life experiences of media creators, and pass it off as its own. At the same time, it leaves content producers with few options to take to combat the phenomenon.
As it stands, the ruling will clearly be precedent in other lawsuits in the AI space, especially when dealing with the producers of original works that are pillaged for training purposes.
Over the years, AI companies were attacked for grabbing any data they could to feed the LLMs, even content scraped from the Internet without permission.
This is a problem that manifests in quite a few ways. The most obvious is in generative AI, as the models could be trained to create images in specific styles, which devalues the work of actual artists.
A example of a fightback is a lawsuit from Disney and Universal against Midjourney, which surfaced in early June. The company behind the AI image generator is accused of mass copyright infringement, for training the models on image of the most recognizable characters from the studio.
The studios unite in calling Midjourney “a bottomless pit of plagarism,” built on the unauthorized use of protected material.
When you have two major media companies that are usually bitter rivals uniting for a single cause, you know it’s a serious issue.
It’s also a growing issue for websites and publishers, like AppleInsider. Instead of using a search tool and viewing websites for information, a user can simply ask for a customized summary from an AI model, without needing to visit the site that it has sourced the information from in the first place.
And, that information is often wrong, combined with data from other sources, polluting the original meaning of the content. For instance, we’ve seen our tips on how to do something plagiarized with sections reproduced verbatim, and mashed up out of order with that from other sites, making a procedure that doesn’t work.
The question of how to deal with compensating the lost revenues of publishers is still one that has not yet been answered in a meaningful way. There are some companies that have been trying to stay on the more ethical side of things, with Apple among them.
Apple has offered news publishers millions to license content, for training its generative AI. It has also paid for licenses from Shutterstock, which helped develop its visual engines used for Apple Intelligence features.
Major publishers have also taken to blocking AI services from accessing their archives, doing so via robots.txt. However, this only stops ethical scrapers, not everyone. And, scraping an entire site takes server power and bandwidth — which is not free for the hosting site that’s getting scraped.
The ruling also follows after an increase in efforts from major tech companies to lobby for a block on U.S. states introducing AI regulation for a decade.
Meanwhile in the EU, there have been attempts to sign tech companies up to an AI Pact, to develop AI in safe ways. Apple is apparently not involved in either effort.