AI Initiative Speaker Series: Legal Risks with GenAI with Professor Mark Lemley

Mark Lemley on AI and Copyright: Training, Outputs, Ownership, and the Coming Collision with Copyright Law

Also from Legal.io:

For talent
- Find jobs that require AI and technology skills
- Compensation data for a wide variety of legal roles
For legal departments
- Hire top legal talent for temporary and permanent roles
- Manage your panel and benchmark outside counsel rates

In this talk, Professor Mark Lemley examines the implications of generative AI for copyright law, with a focus on authorship, fair use, and creative ownership. The conversation explored how legal doctrine must evolve in response to machine-generated content and what the rise of AI means for the future of creative work.

Mark Lemley’s talk offers a clear, practical map of where copyright law is colliding with generative AI. The central point is that AI copyright disputes are not one issue, but three: whether training AI systems on copyrighted works is infringement, whether AI outputs can infringe, and who, if anyone, owns AI-generated works.

Lemley begins with the basic mechanics of generative AI. These systems generate new text, images, code, and other content by training on enormous datasets. Because virtually everything on the modern internet is copyrighted, any serious AI model trained on contemporary material inevitably runs into copyright law. A model trained only on public-domain works from before 1929 might avoid many copyright problems, but it would also know far less about the modern world and speak in a very different register.

That reality has produced a wave of litigation. Lemley notes that there are now more than 100 AI copyright lawsuits in the United States, with additional cases abroad. Most early cases have focused on training: whether copying copyrighted works into a dataset or training process is itself copyright infringement.

Training AI Models: The Fair Use Fight

The first major legal question is whether AI training is copyright infringement. Training involves copying massive amounts of data, including books, images, websites, and other works, into intermediate datasets so the model can learn statistical relationships. The plaintiffs in many cases argue that this copying violates copyright.

Lemley explains that the first court decisions on this issue, both from the Northern District of California, have largely favored AI companies. The courts concluded that AI training can be fair use, although one decision did so reluctantly.

The core reason is transformation. The training process does not generally take a copyrighted work and resell it as that same work. Instead, the copyrighted works are used to create a new system: a model that can generate new outputs. In copyright terms, that makes the use more likely to be considered transformative.

The courts also emphasized that training copies are intermediate and internal. The copied works are not normally displayed to the public or sold as outputs. The model itself, at least in the courts’ current framing, is not a database full of exact copies. It is a set of learned weights and relationships.

The third reason is market harm. Courts have been skeptical that training harms the market for the original copyrighted works, especially where there has not historically been an established licensing market for training data.

But Lemley flags a tension that may become increasingly important. The same AI companies arguing that no licensing market exists are also signing licensing deals with publishers, content owners, and other rights holders. That may help them commercially, but it could weaken the fair use argument for later entrants. If major AI companies create a licensing market, future defendants may find it harder to argue that no such market exists.

He also points to a more aggressive theory some courts seem interested in: that AI outputs compete with human authors, and that this market substitution should count against fair use. Lemley is skeptical of that theory. Under current Ninth Circuit precedent, mere competition with authors is not enough. Copyright does not give authors a right to stop all new works that compete with them. But he expects plaintiffs to continue pushing this argument.

Another issue is whether the training data was lawfully acquired. In one case, the court held that training itself was fair use but suggested that acquiring pirated books could create separate liability. Lemley sees this as a potentially important and dangerous distinction, because many training datasets include material scraped from the open web, where it may be impossible to know whether every upstream website had lawful rights to every work it posted.

Is the Model Itself a Copy?

Lemley then turns to a more technically complex issue: even if training is fair use, is the model itself an infringing copy or derivative work?

Early plaintiffs argued that because a model was trained on copyrighted works, the model must contain those works. Courts rejected that argument in its broad form. A derivative work must actually contain protected expression or something substantially similar to it. A model does not simply store every work it was trained on.

Lemley illustrates this with scale. There are billions of images online, representing an enormous volume of data. A model trained on those images may be only a few gigabytes. It cannot possibly contain lossless copies of all of them. In that sense, models may resemble compression systems, but they are not perfect compression systems.

Text is more complicated. Some research suggests that a small percentage of training text may be memorized by models. That does not mean the model contains all of its training data, but it does mean some works, or portions of works, may be encoded in ways that can later be extracted.

Lemley’s own research with Cooper shows that memorization varies dramatically by model and by book. Some books produce little or no verbatim output. Others produce scattered passages. A few, notably Harry Potter in certain models, appear to be heavily memorized. With the right techniques, researchers can prompt certain base models to reconstruct large portions of those works.

This is not random chance. Lemley emphasizes that reproducing Harry Potter by accident would be astronomically unlikely. If a model can reproduce large portions of a book verbatim, something in the model is encoding that work.

But the legal significance is difficult. Models encode works probabilistically. They do not necessarily contain a conventional fixed copy like a PDF or a photocopy. Instead, they contain probability pathways. Given the beginning of a passage, the model may assign very high probability to the next correct token, then the next, and so on. If the probability pathway is strong enough, the model may reproduce the original work.

Copyright law has not had to deal with this concept of a “probabilistic copy.” Lemley expects future lawsuits to argue that the model itself is a copy, and that distributing the model is therefore infringement. This could be especially important for open-weight models. A closed model like OpenAI’s may exist on a limited number of internal servers. But an open model like Llama may be distributed widely. If the model is legally treated as a copy of a copyrighted book, every distribution of the model could become a separate act of infringement.

Lemley is uneasy with that result. It could make open models more legally vulnerable than closed models, even though open models may be socially and competitively valuable.

AI Outputs: Usually New, Sometimes Infringing

The second major issue is whether AI-generated outputs infringe copyright.

Lemley’s starting point is that most AI outputs are not substantially similar to any particular training work. If a user asks for penguins at a cocktail party, the model does not usually retrieve and display an existing copyrighted image of penguins. It generates a new image.

He uses the Anderson v. Stability AI litigation as an example. When prompted to produce a Sarah Andersen-style cartoon, the model generated something crude and visually unlike the actual cartoon shown. It may have been influenced by the prompt, but it was not a copy.

Still, there are important exceptions.

The first is memorization. If a work appears many times in the training data, the model may reproduce something very close to it. This is especially likely where the same image, article, or passage appears repeatedly online. The model may not be “copying” a single source so much as reconstructing an output from many near-identical sources.

The second exception is adversarial prompting. Plaintiffs’ lawyers can work very hard to generate outputs that look infringing. Lemley discusses Getty Images v. Stability AI, where prompts were engineered to generate an image resembling a Getty photograph, including a blurry Getty-style watermark. Even there, he suggests the output may not be substantially similar enough for copyright infringement, though the watermark raises trademark issues.

The New York Times litigation against OpenAI presents a sharper example. The Times showed outputs that were nearly identical to Times articles. But Lemley notes that the “minimal prompting” described in the complaint involved giving the model the article date, title, and the first several paragraphs. That is strong evidence of memorization, but it may not reflect ordinary user behavior.

He raises a practical concern: courts may be shown outputs generated by lawyers trying very hard to create infringement, not outputs typical users would encounter. That distinction matters when deciding whether the model is a general-purpose tool with some edge-case risks or an infringement machine.

Style, Characters, RAG, and Code

Lemley then identifies several categories where output liability may be more serious.

First, “in the style of” prompts. Users can ask for art in the style of a living artist, or use filters that mimic recognizable aesthetics, such as Studio Ghibli-style images. Current copyright law generally does not protect style. A painter’s general aesthetic, technique, or feel is not copyrightable expression. But Lemley suggests this may become a pressure point. If AI systems can cheaply generate substitutes for a living artist’s commissions by copying that artist’s recognizable style, courts or lawmakers may be tempted to rethink the doctrine.

Second, copyrighted characters are a major vulnerability for AI companies. Models are very good at generating Darth Vader, Baby Yoda, Mario, and other protected characters. Even if a model refuses a direct prompt for Mario, a user may get similar results by asking for a cartoon Italian plumber. The model understands these characters as concepts in the world, much like it understands kittens or coffee cups. The difference is that Darth Vader is copyrighted and kittens are not.

Third, retrieval-augmented generation is more likely to raise copyright issues. RAG systems intentionally retrieve and use text from specific sources. That may be extremely useful, but it also looks more like copying from a particular work and may require licensing or careful permissions.

Fourth, Lemley flags AI-generated code as an emerging problem. Tools like Claude Code may generate functions or chunks of code that resemble existing proprietary or open-source code. If the source is proprietary, that creates a copyright risk. If the source is open source, it may create license-compliance problems. Lemley does not offer final conclusions here, but he sees it as an important area for future litigation and research.

Who Owns AI-Generated Work?

The third major issue is ownership. Under current U.S. copyright law, only humans can be authors. Corporations can own copyrights through doctrines like work made for hire, but copyright still depends on human authorship somewhere in the chain.

Lemley traces this principle through the famous “monkey selfie” case involving Naruto, a macaque who took a photograph with a human’s camera. The Ninth Circuit held that monkeys cannot own copyrights. That rule has now been applied to AI: works created by AI alone are not copyrightable.

The Copyright Office and the D.C. Circuit have taken the position that AI-generated works are not protected, even if they would clearly be original enough for copyright if created by a human.

Human contributions can still be protected. If a person writes text and uses AI to generate accompanying images, the person may own the text but not the images. If a person modifies an AI image in Photoshop, those specific human modifications may be protected. But the AI-generated portions remain outside copyright.

Lemley emphasizes that prompt engineering has not, so far, changed this result. Even if a user tries hundreds of prompts to get exactly the desired image, the Copyright Office has said the resulting image is not copyrightable if the expressive output came from AI.

This creates a major conceptual shift. Historically, copyright protected expression but not ideas. That worked because expression was hard to create. A person might have the idea for a painting, but the protectable value came from the human labor and choices involved in making the painting. AI changes that. It makes expression cheap and easy, but under current law, much of that expression may be unprotectable.

That has practical consequences. If a video game company uses AI to generate all of its background art, can a competitor copy those backgrounds because no human authored them? Under current doctrine, maybe. Lemley expects this to create pressure for legal change as AI-generated content becomes more commercially valuable.

He compares the issue to photography. In the nineteenth century, courts had to decide whether photographs were copyrightable, given that a machine made the image. The Supreme Court ultimately said yes, emphasizing the human choices involved in lighting, posing, framing, and exposure. Over time, the law stopped focusing on those distinctions and simply treated photographs as copyrightable. Lemley suggests AI may follow a similar path: the law may eventually decide to protect AI-assisted works, even if the human role is thinner than traditional authorship doctrine would require.

The Big Takeaways

Lemley’s overall forecast is that training lawsuits are a major threat to AI companies, but likely to lose under current U.S. fair use law. Some licensing markets will emerge, especially in concentrated industries like music, but broad licensing for the entire internet is much harder to imagine.

Output cases will exist, but probably fewer than many expect. Most AI outputs are not copies of specific works. The harder cases will involve memorized works, copyrighted characters, paywall circumvention, RAG systems, and code generation.

Ownership may be the most disruptive issue in the long run. If AI-generated expression is not copyrightable, copyright protection could shrink dramatically as more creative and commercial work is produced with AI. Some people may welcome that. But industries that depend on exclusive rights will likely push for legislative or doctrinal changes.

The Q&A: What Needs to Be Resolved

In the discussion after the talk, Lemley identifies several unresolved questions.

First, we need better technical understanding of memorization. Even model designers often do not fully understand why models memorize certain works and not others. Legal doctrine will depend heavily on technical facts about when models can reproduce protected expression.

Second, courts must decide whether a model itself can be a copy. Lemley’s tentative view is that the answer should usually be no, unless the model can deterministically and repeatedly generate a specific work, as in the strongest Harry Potter examples.

Third, courts must reject or clarify “market dilution” theories. Lemley is concerned about arguments that AI is infringing merely because its outputs compete with human creators. That theory could give copyright owners control over a wide range of new AI-generated works that do not actually copy their expression.

Fourth, policymakers may be tempted by compulsory licensing. Lemley thinks this may work in industries like music, where ownership is concentrated and licensing infrastructure already exists. But he doubts it can work for the open internet. If a model trains on billions of images, even a large royalty pool may translate into tiny payments per creator.

The Q&A also explores the relationship between memorization and output. Lemley compares models to human memory: people can memorize copyrighted songs, books, or movie scenes, but the law does not treat memory itself as infringement. The legal problem arises when protected expression is reproduced. That suggests the law should focus less on what is “inside” a model and more on what the model generates.

Another major theme is the risk of locking down the open internet. If high-quality publishers block crawling or put everything behind paywalls, AI systems may be trained disproportionately on propaganda, low-quality sources, or freely available but less reliable material. Lemley discusses possible responses, including more nuanced robots.txt rules, licensing for high-value sources like news, and legal protection for scraping publicly available material.

On international law, Lemley notes that the U.S. fair use doctrine is unusual. Many countries do not have a broad equivalent. Some jurisdictions, including Japan, Singapore, Israel, and the EU, have text-and-data-mining exceptions, but many were drafted before generative AI and may not clearly apply to today’s systems. Europe’s opt-out approach under AI regulation may create a very different environment from the U.S. The global question is whose law applies when a model is trained in one country and used worldwide.

Finally, Lemley addresses licenses drafted before AI. Broad licenses covering all technologies now known or later developed may well cover AI uses, including feeding works into models or modifying them with AI. But the answer will depend on the specific license language. After courts held that older licenses did not necessarily cover internet uses, many contracts were rewritten to be extremely broad. Those broad grants may now become important in AI disputes.

Lemley closes with a broader point: copyright law is being forced to answer questions it was not designed to answer. Is training a form of copying or learning? Is a model a copy, a tool, or something in between? Is AI output authored, ownerless, or owned by the person who prompted it? The current legal system has partial answers, but many of the most important questions remain unresolved.

For legal professionals, the practical lesson is that AI copyright risk is not one-dimensional. Training, outputs, source data, licensing, characters, style, RAG, code, and ownership all raise distinct issues. Some risks can be reduced through safeguards, licensing, provenance controls, and prompt/output monitoring. But the deeper question is doctrinal: whether copyright law will adapt to AI by narrowing, expanding, or fundamentally rethinking the boundaries of authorship and copying.