On Monday, a United States District Court ruled that training LLMs on copyrighted books constitutes fair use. A number of authors had filed suit against Anthropic for training its models on their books without permission. Just as we allow people to read books and learn from them to become better writers, but not to regurgitate copyrighted text verbatim, the judge concluded that it is fair use for AI models to do so as well. Indeed, Judge Alsup wrote that the authors’ lawsuit is “no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works.” While it remains to be seen whether the decision will be appealed, this ruling is reasonable and will be good for AI progress. (Usual caveat: I am not a lawyer and am not giving legal advice.) AI has massive momentum, but a few things could put progress at risk: - Regulatory capture that stifles innovation, including especially open source - Loss of access to cutting-edge semiconductor chips (the most likely cause would be war breaking out in Taiwan) - Regulations that severely impede access to data for training AI systems Access to high-quality data is important. Even though the mass media tends to talk about the importance of building large data centers and scaling up models, when I speak with friends at companies that train foundation models, many describe a very large amount of their daily challenges as data preparation. Specifically, a significant fraction of their day-to-day work follows the usual Data Centric AI practices of identifying high-quality data (books are one important source), cleaning data (the ruling describes Anthropic taking steps like removing book pages' headers, footers, and page numbers), carrying out error analyses to figure out what types of data to acquire more of, and inventing new ways to generate synthetic data. I am glad that a major risk to data access just decreased. Appropriately, the ruling further said that Anthropic’s conversion of books from paper format to digital — a step that’s needed to enable training — also was fair use. However, in a loss for Anthropic, the judge indicated that, while training on data that was acquired legitimately is fine, using pirated materials (such as texts downloaded from pirate websites) is not fair use. Thus, Anthropic still may be liable on this point. Other LLM providers, too, will now likely have to revisit their practices if they use datasets that may contain pirated works. Overall, the ruling is positive for AI progress. Perhaps the biggest benefit is that it reduces ambiguity with respect to AI training and copyright and (if it stands up to appeals) makes the roadmap for compliance clearer.... [Truncated due to length limit. Full text: https://lnkd.in/gAmhYj3k ]
AI Copyright Law Guide
Explore top LinkedIn content from expert professionals.
-
-
🚨 The Authors Guild and fiction authors including George R.R. Martin, John Grisham, David Baldacci, have filed a complaint against OpenAI alleging #copyright infringement 🚨 What are the actions the Authors Guild alleges give rise to the claims? - Copying of the works - Training their LLMs on the works that were copied - The outputs are derivative works It's interesting in this one because the Authors Guild comes out swinging against any #FairUse argument, even noting the LLMs could have been trained on public domain works. But, the commercial value was in the plaintiff authors' works. The Authors Guild uses the publicly published information about the datasets used by OpenAI to build their LLMs, including the books2 and book3 datasets, and the source materials traced back to Z-Library. They also use Sam Altman's testimony in Congress against him in this complaint, noting "Altman and Defendants have proved unwilling to turn these words into actions." ✍️ Here are the authors currently named: David Baldacci Mary Bly Michael Connelly Sylvia Day Jonathan Franzen John Grisham Elin Hilderbrand Christina Baker Kline Maya Shanbhag Lang Victor LaValle George R.R. Martin Jodi Picoult Douglas Preston Roxana Robinson George Saunders Scott Turow Rachel Vail Here are claims brought: 🫣 Direct copyright infringement - copying of the works as part of the datasets 🫣 Vicarious copyright infringement - because the various OpenAI entities had control and financial interest in the other OpenAI entities and the alleged infringing activities. 🫣 Contributory copyright infringement - same rationale as vicarious, but with contributory elements of claim met #GenerativeAI #ArtificialIntelligence
-
Can Authors Keep Their Work from Being Used to Train AI Without Permission? ✍️📚🤖 If you're a writer, there's a good chance your work has already been absorbed into an AI model—without your knowledge or consent. Books, blogs, fanfiction, forums, articles… All of it has been scraped, indexed, and used to teach machines how to mimic human language. So what can authors actually do to protect their work? Here’s what’s possible (and what isn’t—yet): 🛑 Use “noAI” Clauses in Your Copyright/Terms Clearly state that your work may not be used for AI training. It won’t stop everyone, but it helps establish legal boundaries—and could matter in future lawsuits. 🔍 Avoid Platforms That Allow AI Scraping Before publishing, check the terms of service. Some platforms explicitly allow your content to be used for training; others are more protective. 🖋️ Push for Legal Reform The law hasn’t caught up to generative AI. Supporting copyright advocacy groups and legislation can help tip the scales back toward creators. 🤝 Join Opt-Out Registries Tools like haveibeentrained.com let creators see if their work was used—and request removal from certain datasets. It's not a perfect fix, but it's a start. 📣 Speak Out When authors make noise, platforms listen. Just ask the comic book artists, novelists, and journalists who’ve already triggered investigations and lawsuits. Right now, the balance of power favors the AI companies. But that doesn’t mean authors are powerless. We need visibility. Transparency. Fair compensation. And most of all—respect for the written word. Have you found your writing in an AI training dataset? What did you do? #AuthorsRights #EthicalAI #AIandWriters #GenerativeAI #Copyright #ResponsibleAI #WritingCommunity #AITrainingData #FairUseOrAbuse
-
The AI art reckoning has arrived. Disney and Universal are suing Midjourney—and this might be the case that transforms generative AI forever. At the heart of the lawsuit? Copyright. But it’s bigger than that. According to the complaint, Midjourney scraped thousands of copyrighted movie stills to train its model—without permission. Not just any stills, either. We’re talking about iconic visuals from Star Wars, Harry Potter, Jurassic Park, and Frozen. Then it allowed users to generate “knockoff” versions with prompts like Harry Potter as a 1980s anime, or Stormtroopers in a Wes Anderson film. This isn’t a fringe lawsuit from an indie artist. It’s a direct challenge from the two biggest content empires on Earth. And the implications are massive. • If Disney wins, it could force all GenAI models to purge copyrighted materials from their training sets • It may set a precedent that prompts involving IP are themselves a form of infringement • Every AI model—from OpenAI to Google to Adobe—will be watching closely Here’s my take: This lawsuit isn’t about stopping AI. It’s about controlling who profits from it. The entertainment industry sat back while AI companies built billion-dollar valuations using their IP. Now they’re coming for their cut. And if they succeed, the entire training ecosystem behind today’s top GenAI tools may need to be rebuilt—legally licensed, tightly controlled, and far more expensive to operate. That could change everything. Not just for image generation, but for every domain of GenAI—text, video, voice, code, you name it. We’re entering a new chapter in the AI era. Buckle up, folks! 📖 Full article: https://lnkd.in/er-qUThC
-
Investigative journalism remains the domain of humans, and this Atlantic article is a great example of how to do it. With so much talk around the data used to train large language models (LLMs), Alex Reisner explored the mystery of Books3 and its impact on today’s generative AI technology. “I recently obtained and analyzed a dataset used by Meta to train LLaMA. Its contents more than justify a fundamental aspect of the authors’ allegations: Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.” “Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. . . . These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet.” Reisner interviewed the independent developer of Books3, Shawn Presser, who said he created the dataset to give independent developers “OpenAI-grade training data.” Presser claims he’s sympathetic to authors’ concerns, but he perceives a monopoly on generative AI by the biggest tech companies, “giving them total control of a technology that’s reshaping our culture: He created Books3 in the hope that it would allow any developer to create generative-AI tools.” The arguments of fair use are addressed from both sides, and Rebecca Tushnet, a law professor at Harvard, states that the law is “unsettled” when it comes to fair-use cases involving unauthorized material, with previous cases giving little indication of how a judge might rule in the future. This story is just beginning. Copyright law is sure to be at the center of generative AI conversations for years to come. #ai #copyright #technology https://lnkd.in/g7sWmpnm
-
Big news from Hollywood: The Walt Disney Company and NBCUniversal have just filed the first major lawsuit by top studios against a generative AI company, Midjourney, alleging copyright infringement. For years, the conversation around AI in entertainment has focused on actors and writers fighting to protect their likeness and work. Now, the studios themselves are drawing a line in the sand, teaming up to defend their iconic characters and IP from being used to train AI models and generate lookalike images without permission. The lawsuit, filed in California, alleges both direct and secondary copyright infringement against Midjourney. The studios claim Midjourney’s tool can churn out images virtually identical to beloved characters from “The Lion King,” “Aladdin,” and NBCU’s “Minions”—and that repeated attempts to resolve the issue privately were ignored. Instead, Midjourney allegedly doubled down, releasing new versions of its service that the studios say produce even higher-quality infringing images. This case could set a precedent for how copyright law applies to AI-generated content, with the studios seeking not just damages but also an injunction to stop further use and distribution of their copyrighted material. Hollywood isn’t alone here; other industries, like news media, are also launching lawsuits against AI companies over similar concerns. Last week, Universal Music Group, Warner Music Group, and Sony Music Entertainment began talks to license their work to AI startups Udio and Suno for equity. This deal could resolve lawsuits between the companies, which sued Udio and Suno last year for copyright infringement. Can we expect the same talks to begin with Disney, NBCUniversal, and MidJourney? Stay tuned, this story is just getting started. #AI #Disney #Universal #MidJourney #IP
-
From Practitioner to Author: 3 Things I Loved… and 3 That Left Me Speechless about Humanity! Writing “Your AI Survival Guide” was never about becoming an author. It was about turning 20+ years of executive scars, late-night whiteboard wars, and “we-need-it-by-yesterday” pivots into something others could learn from. What I LOVED 💗 in the process: 1. Democratizing knowledge — I’ve led AI and data initiatives across 9 industries, 5 continents, and countless war rooms. Being able to share how we solved real problems? Invaluable. 2. Codifying what actually worked — From frameworks to models to repeatable patterns—documenting it all helped me see what drove results and what caused failure. It was part therapy, part blueprint. 3. Reflecting on the full arc — Not just the wins, but the detours, disasters, and hard-won lessons. Writing forced me to zoom out, connect dots, and find clarity I didn’t know I needed. But here’s where things took a turn… What shocked 😳 and disappointed me 💔: 1. People stealing my frameworks (yes, the ones I copyrighted and published) and claiming them as their own. No license. No source. Just copy-paste and rebrand. 🤯 2. Discovering someone converted my book into a PDF, plugged it into a custom GPT, and started selling AI services using my IP. Without a single line of credit or acknowledgment. 3. I’m not a marketer or salesperson. But if you don’t promote your work, no one will. Having to “talk about the book” felt like pulling teeth—but I did it anyway. Because awareness matters and it was the only way to get it out there. 🎯 So what did I ultimately learn: ✅ Protect your IP. Copyright it. Trademark it. Watermark it if you must. BUT EVEN WITH ALL THAT - there’s no guarantees. ✅ Visibility ≠ vanity. It’s how you defend your work and ideas. I’m still learning this to be honest, but getting over it (slowly). ✅ If you’re not comfortable promoting yourself, promote the value your work delivers. *** Did you know, 1 in 5 business authors now report IP misuse through AI tool integrations. That’s not just scary—it’s sad. Wonder if in a year or two it’s going to be worth publishing anymore 🤷🏻♀️. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Worlds 1st Chief AI Officer for Enterprise, 10 patents, former Amazon & C-Suite Exec (5x), best-selling author, FORBES “AI Maverick & Visionary of the 21st Century” , Top ‘100 AI Thought Leaders’, helped IBM launch Watson in2011. My job, it’s not just to develop, but to create - I’m outfitting our new digital identity and ensuring security of our workforce and data in the age of AI! (And yes, that’s a band-aid on the cover. After all, I wrote about how to deploy AI and what mistakes to avoid)
-
The U.S. Copyright Office has provided essential guidance regarding the registration of works containing material generated by Artificial Intelligence (AI). With more artists thinking about using AI as a part of their creative process, this is a critical document for not only for music lawyers but also for music managers who are helping their clients navigate the use of AI in music. Here are the key takeaways from the Copyright Office's policy statement (full paper is attached below for those who are interested): 🎵 Human Authorship Requirement: Works exclusively generated by AI without human involvement do not qualify for copyright protection as "original works of authorship" must be human-created. 🎵 Significant Human Contribution: The use of AI-generated content that is significantly modified, arranged, or selected by a human artist may be eligible for copyright protection, but only for the human-authored parts of the work. 🎵 AI as a Tool: While AI is acknowledged as a valuable tool in the creative process, using AI does not confer authorship. The extent of creative control a human exercises over the work's output is the key factor in determining copyright eligibility. 🎵 Registration of Works with AI-generated Material: Applicants must disclose the use of AI-generated content in their copyright applications, distinguishing between human-created aspects and AI-generated content. 🎵 Correcting Prior Submissions: If a work containing AI-generated content has already been submitted without appropriate disclosure, it should be corrected to ensure the registration remains valid. 🎵 Consequences of Non-disclosure: Applicants who fail to disclose AI-generated content could face the cancellation of their registration or the registration could be disregarded in court during an infringement action. 🎵 Ongoing Monitoring: The Copyright Office continues to monitor developments in AI and copyright law, indicating the possibility of future guidance and adjustments to the policy. #musicindustry #musicbusiness #musicpublishing #copyrightlaw
-
BIG AI (Litigation) News: Major decision out of the Northern District of California. AI training on copyrighted material is fair use under the Copyright Act, but obtaining the works by piracy (or a pirate website) is a separate violation. 1️⃣Pirated copies used by Anthropic used violated copyright, but to the extent that they had purchased copies, that was fair use. 2️⃣Storing the pirated copies indefinitely was also infringement. 3️⃣Buying the books after pirating them doesn't make everyone whole, but it might affect the statutory damages. 4️⃣Fair use includes the transformative use of the works to train AI models. The end result is that this is probably more beneficial to the authors who brought the claim based on the statutory damages + class action potential, but it also sets a path forward for fair use and transformative use for training. (Link to the opinion in comments!) Mark Lemley James Gatto
-
If you're an author or researcher, odds are your work has been used to train LLMs... My work on the microbiome is in the database Meta used to train their AI models. And I'm far from alone. Meta leveraged the contents of millions of books, research papers, and academic articles to train their Llama models. No permission. No compensation. No acknowledgment. As the recent release of the LibGen database reveals, Meta chose expediency over ethics, piracy over permission. Because going through proper channels was deemed "too slow." "Move fast and break things." Unsurprisingly, Mark Zuckerberg was the first to popularize Silicon Valley's oft-repeated mantra. But now is the time for slow, thoughtful, methodical sustainability... Our words, research, and ideas aren't just raw materials to be mined like some sort of intellectual coal deposit. They're the product of human thought, reflection, and creativity. They represent years of study, experimentation, and iteration. Why are intellectual property violations such a big deal, you ask? 1. Attribution and context: Research exists within a framework of citations, methodologies, and evolving understanding. AI strips this away, presenting findings as disembodied facts. 2. Academic integrity: Research constitutes specialized knowledge that belongs to its stakeholders, which includes investors, builders, and community beneficiaries. Forcibly taking this work for other purposes constitutes theft. 3. Future sustainability: If tech giants can just take whatever they want without giving back, who will fund tomorrow's breakthroughs? Meta's actions reveal their contempt for creators. They're not "democratizing knowledge"—they're monopolizing it. Building models on stolen work isn't innovation—it's appropriation. This is about who owns the future. Multi-trillion dollar companies are asking, "Why buy the rights to something if you can just steal it?" What we're asking for isn't radical: transparency about what content is being used, mechanisms to opt out, and fair compensation models. These requirements should be the bare minimum. Have you found your work in The Atlantic's Libgen database?