BIG AI (Litigation) News: Major decision out of the Northern District of California. AI training on copyrighted material is fair use under the Copyright Act, but obtaining the works by piracy (or a pirate website) is a separate violation. 1️⃣Pirated copies used by Anthropic used violated copyright, but to the extent that they had purchased copies, that was fair use. 2️⃣Storing the pirated copies indefinitely was also infringement. 3️⃣Buying the books after pirating them doesn't make everyone whole, but it might affect the statutory damages. 4️⃣Fair use includes the transformative use of the works to train AI models. The end result is that this is probably more beneficial to the authors who brought the claim based on the statutory damages + class action potential, but it also sets a path forward for fair use and transformative use for training. (Link to the opinion in comments!) Mark Lemley James Gatto
Copyright Law for AI Content Training
Explore top LinkedIn content from expert professionals.
Summary
The evolving discussion about copyright law for AI content training focuses on how copyrighted materials can be used to teach AI systems while respecting intellectual property rights. Recent rulings indicate that training AI on copyrighted works may qualify as fair use if done lawfully and transformatively, but using pirated content remains a clear infringement.
- Source data lawfully: Always ensure that training datasets are obtained through legitimate channels to avoid copyright infringement and potential legal consequences.
- Emphasize transformative use: Structure AI training in ways that significantly transform the original content, as courts are more likely to consider this fair use.
- Ensure compliance: Develop practices to trace the provenance of your training data, as transparency and adherence to licensing requirements are critical for legal AI model development.
-
-
The era of “train now, ask forgiveness later” is over. The U.S. Copyright Office just made it official: The use of copyrighted content in AI training is no longer legally ambiguous - it’s becoming a matter of policy, provenance, and compliance. This report won’t end the lawsuits. But it reframes the battlefield. What it means for LLM developers: • The fair use defense is narrowing: “Courts are likely to find against fair use where licensing markets exist.” • The human analogy is rejected: “The Office does not view ingestion of massive datasets by a machine as equivalent to human learning.” • Memorization matters: “If models reproduce expressive elements of copyrighted works, this may exceed fair use.” • Licensing isn’t optional: “Voluntary licensing is likely to play a critical role in the development of AI training practices.” What it means for enterprises: • Risk now lives in the stack: “Users may be liable if they deploy a model trained on infringing content, even if they didn’t train it.” • Trust will be technical: “Provenance and transparency mechanisms may help reduce legal uncertainty.” • Safe adoption depends on traceability: “The ability to verify the source of training materials may be essential for downstream use.” Here’s the bigger shift: → Yesterday: Bigger models, faster answers → Today: Trusted models, traceable provenance → Tomorrow: Compliant models, legally survivable outputs We are entering the age of AI due diligence. In the future, compliance won’t slow you down. It will be what allows you to stay in the race.
-
On Monday, a United States District Court ruled that training LLMs on copyrighted books constitutes fair use. A number of authors had filed suit against Anthropic for training its models on their books without permission. Just as we allow people to read books and learn from them to become better writers, but not to regurgitate copyrighted text verbatim, the judge concluded that it is fair use for AI models to do so as well. Indeed, Judge Alsup wrote that the authors’ lawsuit is “no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works.” While it remains to be seen whether the decision will be appealed, this ruling is reasonable and will be good for AI progress. (Usual caveat: I am not a lawyer and am not giving legal advice.) AI has massive momentum, but a few things could put progress at risk: - Regulatory capture that stifles innovation, including especially open source - Loss of access to cutting-edge semiconductor chips (the most likely cause would be war breaking out in Taiwan) - Regulations that severely impede access to data for training AI systems Access to high-quality data is important. Even though the mass media tends to talk about the importance of building large data centers and scaling up models, when I speak with friends at companies that train foundation models, many describe a very large amount of their daily challenges as data preparation. Specifically, a significant fraction of their day-to-day work follows the usual Data Centric AI practices of identifying high-quality data (books are one important source), cleaning data (the ruling describes Anthropic taking steps like removing book pages' headers, footers, and page numbers), carrying out error analyses to figure out what types of data to acquire more of, and inventing new ways to generate synthetic data. I am glad that a major risk to data access just decreased. Appropriately, the ruling further said that Anthropic’s conversion of books from paper format to digital — a step that’s needed to enable training — also was fair use. However, in a loss for Anthropic, the judge indicated that, while training on data that was acquired legitimately is fine, using pirated materials (such as texts downloaded from pirate websites) is not fair use. Thus, Anthropic still may be liable on this point. Other LLM providers, too, will now likely have to revisit their practices if they use datasets that may contain pirated works. Overall, the ruling is positive for AI progress. Perhaps the biggest benefit is that it reduces ambiguity with respect to AI training and copyright and (if it stands up to appeals) makes the roadmap for compliance clearer.... [Truncated due to length limit. Full text: https://lnkd.in/gAmhYj3k ]
-
U.S. District Judge William Alsup ruled that Anthropic’s use of copyrighted books to train its AI model, Claude, qualifies as fair use. This decision addresses the lawsuit filed by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, who alleged that Anthropic infringed their copyrights by using pirated versions of their books for AI training. Judge Alsup concluded that the AI training process was “exceedingly transformative,” comparing it to a human writer learning from existing works to create new content. He emphasized that the AI’s purpose was not to replicate or replace the original works but to generate distinct, innovative outputs. However, the court found that Anthropic infringed on copyrights by storing more than 7 million pirated books in a centralized repository, which was not directly tied to the AI training process. This action was deemed not to fall under fair use, and a trial is scheduled for December to determine potential damages, which could reach up to $150,000 per infringed work. The ruling provides legal clarity for AI developers, affirming that using copyrighted materials for training purposes can be considered fair use if the process is transformative. However, it also underscores the importance of sourcing training data lawfully, as the unauthorized acquisition and storage of copyrighted works remain subject to infringement claims. While this ruling is noteworthy, the plaintiffs will appeal, so Bartz v. Anthropic PBC isn’t over; it’s really just getting started.