Copyright Infringement Claims Against Generative AI: The New York Times, Getty, and What Comes Next

In brief. The first merits rulings on AI training and copyright have arrived, and they do not point in one direction. Training a generative model on lawfully acquired books has twice been called transformative fair use—once "spectacularly" so. A non-generative research tool lost on fair use outright because it competed with the source. Acquiring works by piracy produced the largest copyright settlement in American history. And a German court has held that a model's memorization of song lyrics is itself an act of reproduction requiring a license. Three things now decide everything: how a developer acquired its training data, what kind of product the model powers, and whether the model regurgitates what it ingested. This article maps the entire landscape as of mid-2026, walks through the doctrines as courts are really applying them, and tells content owners and developers what to do about it. It is not legal advice; this field changes monthly.

A doctrine built for photocopiers meets a machine that read the internet

In 1984, the Supreme Court decided whether the Betamax video recorder was legal. The technology that frightened Hollywood was a box that could copy The Tonight Show onto a tape. Forty years later, the machines at issue have copied essentially every book, photograph, news article, song lyric, and message-board post that anyone ever put online, distilled them into billions of numerical parameters, and learned to produce passable imitations of all of it on demand. The doctrine we are using to sort out the legality of that—fair use, codified at 17 U.S.C. § 107—was written for librarians, parodists, and people with photocopiers. Watching it stretch to cover a model trained on the collected output of human civilization is one of the more interesting spectacles in modern law.

The emergence of generative artificial intelligence has triggered what may become the most consequential copyright litigation in decades. Newspaper publishers, stock-photography giants, music rightsholders, book authors, and now Hollywood studios have all sued the companies whose large language models and image generators were trained on their works without permission or payment. The dollar figures are enormous, the technology is genuinely novel, and the doctrine is being made in real time. When the dust settles, the outcome will shape both the trajectory of AI development and the economics of every industry that produces copyrightable expression—which is to say, most of them.

At the center of the storm sit a handful of marquee cases—The New York Times v. OpenAI, Getty Images v. Stability AI, and the author class actions against Anthropic and Meta—each with distinct facts but raising overlapping questions. Does copying copyrighted works into a training dataset infringe? Can fair use protect that copying as "transformative"? And what happens when the trained system generates an output that reproduces, or closely resembles, a protected work? When this topic first started moving through the courts, those questions were almost entirely open. They no longer are—at least not entirely.

Between February 2025 and mid-2026, courts on two continents handed down the first substantive rulings, a record settlement moved from preliminary toward final approval, and the U.S. Supreme Court declined to disturb the rule that AI-generated works without human authorship cannot be copyrighted at all. The emerging picture is nuanced rather than binary. This article is meant to be the canonical map of that picture, current as of mid-2026. It explains the theories, walks through the four fair-use factors as courts are actually applying them, distinguishes the input problem from the output problem, surveys the licensing and international dimensions, and tells the people on both sides of the table what to do.

To keep the stakes concrete, meet two players we will follow throughout. Lumen Press is a mid-sized publisher of journalism and trade nonfiction whose archive—decades of reporting and several thousand books—has almost certainly been ingested by every major model. Verdantia AI is a fast-growing model developer that trained its flagship system on a web-scale corpus assembled from purchased e-books, licensed archives, and, in its scrappy early days, a few datasets of, let us say, uncertain provenance. Lumen must decide whether to sue, license, or opt out. Verdantia must decide which parts of its training pipeline are worth defending and which it must quietly rebuild. The cases below are the map both of them are reading—and, as we will see, that map now has very different terrain depending on which of those two questions you are asking. This piece sits within our broader series on AI and the law; for the full landscape beyond copyright, start with our overview of artificial intelligence's key legal issues.

The litigation landscape: the cases that are making the law

The New York Times v. OpenAI and Microsoft

Filed in December 2023 in the Southern District of New York, The New York Times Co. v. Microsoft Corp. remains the highest-profile challenge to AI training practices, and for good reason: the Times is a sophisticated, well-resourced plaintiff with a registered archive, a metered paywall whose economics it can document, and a willingness to litigate to judgment. The complaint alleges that OpenAI and Microsoft used millions of Times articles to train the GPT models that power ChatGPT and Copilot—without a license and without payment.

The pleading advances three distinct theories, and the distinction matters because they fail or succeed independently. First, direct infringement at the input stage: the unauthorized reproduction of articles as they were scraped, stored, and processed into the training corpus. Second, output-based infringement through memorization: ChatGPT, the Times showed, would sometimes reproduce its articles near-verbatim when prompted with the opening lines—effectively walking a user around the paywall. Third, market substitution by synthetic outputs: AI-generated summaries and "answers" that displace demand for the underlying journalism, the way an excellent abstract can make you feel you have already read the article.

In March 2025, Judge Sidney Stein largely denied OpenAI's motion to dismiss, allowing the core infringement claims to proceed while trimming some ancillary theories, and finding that the Times had plausibly alleged both training-based copying and output-based harm. That ruling did not decide who wins; a motion to dismiss only tests whether the complaint states a claim. But it cleared the case for discovery and merits litigation, and discovery in this field is where the bodies are buried. The case has since been folded into a multidistrict proceeding, In re OpenAI, Inc. Copyright Infringement Litigation, in the Southern District of New York, consolidating related actions by the New York Daily News, the Center for Investigative Reporting, author classes, and additional publishers such as Ziff Davis, with summary-judgment practice now underway. The MDL has produced its own headline-grabbing discovery fights, including a preservation order requiring OpenAI to retain ChatGPT conversation logs that the company would ordinarily delete—raising genuinely novel questions about whether litigating an AI copyright case requires turning hundreds of millions of users' private chats into evidence. This MDL is shaping up as the bellwether for news-industry claims and, in particular, for the output-side theory that the California book cases mostly left for another day.

Getty Images v. Stability AI

Getty Images—one of the world's largest licensors of stock photography—has pursued parallel litigation against Stability AI in both the United Kingdom and the United States, alleging that Stability scraped more than 12 million Getty photographs, together with their captions, metadata, and distinctive watermarks, to train the Stable Diffusion image generator.

The UK case reached judgment in the High Court of Justice (Chancery Division) on 4 November 2025—the first major English ruling on AI and copyright—but it landed narrower than either party hoped. Mid-trial, Getty withdrew its primary training and output copyright claims after conceding that Stable Diffusion's training had occurred outside the United Kingdom, which meant the central territorial question simply was not before the court. What remained was a secondary-infringement theory: that the trained model, when made available in the UK, was itself an "infringing copy" being imported. The court rejected it, reasoning that although the model had been exposed to copyrighted works during training, the model's weights do not store those works—they encode statistical relationships, not the pixels of any particular photograph. Getty prevailed only in limited part, on trademark grounds, where Stable Diffusion had reproduced Getty's watermark in generated images.

The decision's most consequential move is doctrinal and quiet: its treatment of model weights as statistical representations rather than stored copies. Hold that thought. As we will see, a German court looking at almost the same technical question reached almost the opposite conclusion within a week—and the gap between them is now the single most important unresolved question in global AI copyright. The U.S. Getty case continues in the District of Delaware, where fair use—a defense the UK does not recognize—will frame everything.

The music cases: Concord v. Anthropic and the GEMA earthquake

Major music publishers led by Concord sued Anthropic in 2023, alleging that its Claude model was trained on copyrighted song lyrics scraped from the web and would reproduce them on request—an output-focused theory distinct from the training-centric newspaper claims. In early 2024 the parties reached a partial agreement under which Anthropic maintains "guardrails" designed to suppress lyric reproduction, while the underlying infringement claims proceed. The case has since generated a revealing defense argument: that outputs elicited by the plaintiffs' own investigators—who prompted Claude specifically to get it to spit out lyrics—cannot ground secondary liability, because the only "direct infringer" in those transactions is the plaintiff's agent. Whether plaintiff-engineered prompts can manufacture the predicate infringement is a question that now recurs across the entire docket.

The lyrics theory then produced the most striking plaintiff victory anywhere in the world—in Germany. On 11 November 2025, the Regional Court of Munich I (Landgericht München I) ruled for GEMA, the German music collecting society, against OpenAI (Case No. 42 O 14139/24), in the first European merits decision on these questions. The court held that the memorization of song lyrics within a model's parameters is itself a reproduction under Section 16 of the German Copyright Act (UrhG)—rejecting OpenAI's argument that the model contains only statistical correlations and storing nothing. The proof was almost embarrassingly simple: trivial prompts caused GPT to reproduce lyrics such as the German hit "Atemlos" nearly verbatim, which the court took as direct evidence that the works were reproducibly contained in the model. Critically, the court held that the EU's text-and-data-mining exception covers the initial analytical phase of assembling and processing training data, but does not cover the persistence of protected works inside the trained model. It ordered injunctive relief, disclosure, and damages. An appeal is expected.

Now return to the thought you were holding. The Munich court and the London court looked at the same artifact—a trained neural network—and reached opposite conclusions about whether it "contains" copies. London said weights are statistics, not copies. Munich said that if a simple prompt reliably extracts the original work, the work is, as a matter of fact, in there. This is not a transatlantic split; it is an intra-European one, and it is the fault line along which global compliance strategy now has to be built.

Thomson Reuters v. ROSS Intelligence

The first federal ruling on fair use in AI training did not involve a chatbot at all, and it went against the AI developer. Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence Inc. concerned a non-generative legal-research tool. ROSS had used Westlaw headnotes—Thomson Reuters's editorial summaries of court holdings—to build training data for a competing legal-research engine. In February 2025, Judge Stephanos Bibas (sitting by designation in the District of Delaware) granted partial summary judgment for Thomson Reuters, holding that the use was not fair.

Two findings did the work. First, ROSS used the headnotes for "essentially the same purpose" Thomson Reuters created them—facilitating legal research—so the use was not transformative. Second, and decisively, ROSS's product competed directly with Westlaw in the same market, implicating the fourth factor's market-harm concern at its core. Judge Bibas expressly cabined his analysis to "non-generative AI," reserving whether systems that synthesize genuinely new content present stronger transformative-use arguments. The case is on interlocutory appeal to the Third Circuit—the first federal appellate court to take up fair use in AI training—where it remains pending as of this writing, presenting both the copyrightability of headnotes and the fair-use question. Thomson Reuters is the cautionary pole of the doctrine: copy your competitor's material to build a substitute for your competitor, and "AI" in the product name will not save you.

Bartz v. Anthropic and Kadrey v. Meta: the twin June 2025 rulings—and the record settlement

In June 2025, two Northern District of California decisions issued within days of one another and, between them, drew the line the whole field had been waiting for.

In Bartz v. Anthropic, Judge William Alsup held that Anthropic's use of copyrighted books to train Claude was "exceedingly transformative"—"spectacularly so"—analogizing the model's ingestion of text to a human reading widely and learning to write. So far, a clean developer win. But the same order drew a sharp line on acquisition. Anthropic had not merely trained on books; it had downloaded millions of them from pirate "shadow libraries" and retained them in a permanent, general-purpose internal library. That, Judge Alsup held, was not fair use. Training could be transformative; building a pirated library to do it was just piracy with extra steps. The opinion's elegance is in that bifurcation: it decouples the legality of the use from the legality of the source.

In Kadrey v. Meta, decided two days later, Judge Vince Chhabria likewise granted summary judgment for Meta on the training question, finding the use transformative—but on pointedly grudging grounds. He warned that transformativeness does not, by itself, decide fair use, and that a better-developed evidentiary record on market dilution—the prospect of AI-generated works flooding and devaluing the markets for the originals—could defeat the defense in a future case. He all but invited the next set of plaintiffs to come back with the economics built out. Kadrey is a win the way a 6–5 game is a win: the scoreboard says one thing, the film says the defense should be nervous.

The Bartz piracy holding then detonated. Statutory damages under 17 U.S.C. § 504(c) run from $750 to $30,000 per work, and up to $150,000 per work where infringement is willful. Multiply even a conservative per-work figure across a class that ultimately encompassed roughly half a million works, and the exposure runs into the tens of billions. Facing that arithmetic, Anthropic agreed in late 2025 to a proposed class settlement of at least $1.5 billion—the largest copyright settlement in U.S. history. The settlement received preliminary approval; class notices went out in November 2025; the opt-out deadline passed in January 2026; claims closed on 30 March 2026 with an extraordinary claims rate exceeding ninety percent of eligible works; and the final fairness hearing was held on 14 May 2026, with distributions to follow approval.

Two features deserve emphasis. First, because the case settled, the fair-use-for-training holding was never tested on appeal—the settlement creates no binding precedent, and a different judge on a different record could see the training question differently. Second, the practical lesson is unmistakable and quantifiable: provenance is everything. For Verdantia AI, the message is that its purchased and licensed corpora may well be defensible, while those early "datasets of uncertain provenance" are a nine-figure liability that no amount of transformative-use brilliance will cure. For a fuller treatment of why piracy is a different and worse animal than ordinary infringement, see what are the consequences of pirating intellectual property.

The studio cases: Disney, Universal, and Warner Bros. v. Midjourney

In June 2025, Disney and Universal filed the first major Hollywood lawsuit against an AI image generator, suing Midjourney over outputs depicting their copyrighted characters; Warner Bros. followed with a parallel action that has since been consolidated. These cases live almost entirely on the output side. The studios do not need to win the abstract training debate; they allege that Midjourney functions as a "virtual vending machine" for infringing images of instantly recognizable characters—your prompt, their Darth Vader. Because character-based claims involve visibly protected expression rather than diffuse statistical learning, they may prove far harder to defend than training claims, and they squarely present the question of what filtering and guardrail obligations a generative service must build. The doctrine here borrows from a deep well of character-copyright law; our piece on intellectual property disputes concerning superheroes traces how courts decide when a character is itself protectable expression. The Midjourney litigation remains early, but it is the case most likely to produce a clean, judge-friendly finding of output infringement.

The authorship bookend: Thaler v. Perlmutter

One adjacent question closed definitively in 2026. In Thaler v. Perlmutter, the D.C. Circuit affirmed the Copyright Office's refusal to register a work generated autonomously by an AI system, holding that copyright requires human authorship—and on 2 March 2026, the Supreme Court denied certiorari, leaving the human-authorship requirement settled. Thaler concerns who may own AI outputs, not whether training infringes, but it frames the whole field: a machine can neither hold a copyright in what it makes nor, as we explain in our companion piece on AI-generated inventions and our survey of machine inventorship across jurisdictions, be named an inventor on a patent. The point has a sly consequence for our two players: even if Verdantia's outputs escape infringement, those outputs may carry thin or no copyright protection of their own—a fact Lumen Press should remember before it worries too much about being "out-competed" by uncopyrightable machine prose.

The building blocks: infringement elements and theories

Before parsing fair use, it pays to be precise about what a plaintiff actually has to prove, because the AI cases turn on which element is contested.

Direct infringement

To establish direct infringement, a plaintiff must show ownership of a valid copyright and copying of protectable expression—conventionally framed as ownership, copying, and substantial similarity between the works (17 U.S.C. §§ 106, 501). Direct infringement is a strict-liability tort: a defendant's good faith, ignorance, or innocent intent does not excuse it (though intent governs the damages tier). Registration matters too, and procedurally so: U.S. works generally must be registered before suit, and timely registration—before publication or within three months after—unlocks statutory damages and attorneys' fees under 17 U.S.C. § 412. Those mechanics, dry as they sound, did real work in Bartz, where the class definition tracked registrability; we walk through them in how to register a copyright with the U.S. Copyright Office.

On the copying element at the input stage, the AI cases are almost not a contest. Training inherently reproduces works—scraping them, writing them to disk, loading them into memory, and processing them to adjust billions of parameters. Developers generally do not deny that copying occurs; they stake everything on fair use. The genuinely hard copying question is at the output stage: when a model generates content, has it copied any particular work? That turns on substantial similarity to the protected expression of a specific source. Where the output closely tracks a known work—ChatGPT reciting a Times paragraph, Stable Diffusion painting in a Getty watermark, Claude reciting a chorus—substantial similarity is readily shown. Where the output merely reflects patterns learned from millions of works without reproducing any one, the link may be too attenuated to count as copying at all. The Munich GEMA court added a third, U.S.-courts-have-so-far-resisted locus of copying: the trained model itself, treated as a vessel in which memorized works persist.

Substantial similarity: the output gatekeeper

Output claims rise and fall on substantial similarity—the same doctrine that governs whether one novel infringes another. Courts ask whether the accused work copies protected expression, filtering out unprotectable ideas, facts, and scènes à faire, and then assessing whether what remains is substantially similar to the plaintiff's expression (often via some version of the "ordinary observer" or "extrinsic/intrinsic" inquiry, depending on the circuit). For a generated paragraph that mirrors a registered article, this is straightforward. For a generated image "in the style of" an artist, it is treacherous, because style is precisely what copyright does not protect—a point we develop in the fine line: copyright protection for style, typefaces, and fonts. The studios' Midjourney claims thread this needle by anchoring on characters, which are protectable expression, rather than on style, which is not.

Derivative-works claims

Some plaintiffs argue that the model, or its outputs, are unauthorized derivative works—works "based upon" preexisting works, like translations or film adaptations—because the model "incorporates" learned representations of the corpus. U.S. courts have generally been skeptical: the derivative-work right (17 U.S.C. § 106(2)) typically requires that the new work incorporate the original's protected expression in some recognizable form, and a model stores statistical patterns, not literal passages. Whether pattern-learning amounts to "incorporation" remains contested—and GEMA's finding that whole lyrics were "reproducibly contained" in GPT's parameters shows that, at least where memorization is provable work-by-work, the factual premise of the skeptics' position can simply fail. The derivative-works theory is strongest precisely where memorization is demonstrable and weakest where the model has truly generalized. (On the flip side, when a human uses AI to build a new work atop an existing one, ordinary derivative-work rules still apply; see copyright registration for derivative works.)

Secondary liability: contributory, inducement, and vicarious

Where end users do the prompting, plaintiffs reach for secondary liability. Contributory infringement requires (1) knowledge of, reason to know of, or willful blindness to a third party's direct infringement, plus (2) a material contribution to it (see Erickson Prods., Inc. v. Kast, 921 F.3d 822, 832 (9th Cir. 2019)). The Times alleges OpenAI knows ChatGPT can be prompted to reproduce Times content and materially contributes by supplying the tool. Inducement ratchets this up where a defendant actively encourages infringing use. Vicarious liability requires the right and ability to supervise the infringing conduct plus a direct financial benefit—doctrine inherited from the Napster and Grokster peer-to-peer wars. AI companies that profit from systems capable of infringing output face exposure under all of these, and their guardrails cut both ways: a filter demonstrates the ability to supervise (helping plaintiffs on vicarious liability) even as it mitigates harm (helping defendants on damages and equities). The recurring wrinkle, prominent in the Anthropic music litigation, is whether outputs elicited by the plaintiffs' own investigative prompts can ground secondary liability at all, since the "direct infringer" in those transactions is the plaintiff's investigator. There is a broader platform-liability dimension here too, including the limits of safe harbors for user-generated infringement, which we treat in Section 230 reform and platform liability for user-generated IP infringement.

The DMCA § 1202 wrinkle: copyright management information

There is a quieter statutory claim that the AI cases keep surfacing, and it can matter a great deal to damages. Section 1202 of the DMCA (17 U.S.C. § 1202) prohibits knowingly removing or altering copyright management information (CMI)—the identifying data attached to a work, such as the title, author, terms of use, and copyright notice—or distributing works knowing the CMI has been stripped, where done to induce, enable, facilitate, or conceal infringement. The Times and Getty both press § 1202 theories: when a scraper ingests an article or a photograph but discards the byline, the rights metadata, and the watermark, and the model later regurgitates the content shorn of that information, plaintiffs argue the CMI was knowingly removed to facilitate downstream infringement. The appeal of § 1202 is partly arithmetic: it carries its own statutory damages of $2,500 to $25,000 per violation, independent of the underlying infringement, so a plaintiff who can prove systematic CMI stripping across a corpus stacks a second damages theory on top of the first. Whether mass training-data ingestion satisfies § 1202's knowledge and "removal" requirements is unsettled and circuit-dependent, but it is a live and underrated front. (For how CMI fits the broader anti-circumvention scheme, including the Chamberlain/MDY split over whether a § 1201 claim needs an infringement nexus, the analysis in our practice materials on DMCA claims is a useful companion.)

The fair use defense: applying the four factors as courts actually do

Fair use, codified at 17 U.S.C. § 107, is the load-bearing defense to training-data claims. It is, at bottom, an "equitable rule of reason" (Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 448 (1984)) that asks courts to balance four non-exclusive factors, with no single factor dispositive (Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 577 (1994)). In the AI cases, the first and fourth factors do nearly all the real work; the second and third tend to follow along. A crucial threshold reminder: fair use is a creature of U.S. copyright law. It does not travel. A use that is impeccably fair under § 107 enjoys no protection in the UK, Germany, or anywhere else without a fair-use or fair-dealing analogue—which is exactly why Getty's UK case and OpenAI's German case played out on entirely different doctrinal terrain.

Factor one: purpose and character of the use

The first factor asks whether the use is commercial and whether it is transformative—whether it adds new expression, meaning, or message, or serves a different function, purpose, or character than the original (Campbell, 510 U.S. at 579). After Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023), the inquiry has tightened: the question is whether the secondary use has a genuinely "distinct purpose or different character," assessed alongside commerciality, and a use that merely "adds some new expression" while serving substantially the same purpose as the original—and competing in the same market—is not transformative enough to carry the factor. Warhol is the hinge on which the modern AI cases swing, because it forces courts to ask not "did the defendant change the work?" but "does the defendant's use compete with the same purpose the original serves?"

Developers argue training is transformative because it serves a fundamentally different purpose: a novel is written to be read; training a model extracts statistical patterns from the novel to enable a system that generates. That argument has a strong pedigree. The intermediate-copying line—Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992), and Sony Computer Entertainment, Inc. v. Connectix Corp., 203 F.3d 596 (9th Cir. 2000)—holds that copying an entire work as an intermediate step toward a new, non-infringing product can be fair use precisely because the intermediate copies serve a different function and do not end up in the final product. The book-search line is even closer: in Authors Guild, Inc. v. Google, Inc., 804 F.3d 202 (2d Cir. 2015), and Authors Guild, Inc. v. HathiTrust, 755 F.3d 87 (2d Cir. 2014), courts blessed the wholesale digitization of millions of books to build a full-text search index, reasoning that searching a book to find where a phrase appears serves a different purpose than reading the book. The Bartz court leaned hard on exactly this lineage; the Kadrey court accepted it more warily.

Thomson Reuters is the counterexample that proves the rule. ROSS lost factor one because it used Westlaw's headnotes for the same purpose Thomson Reuters created them—legal research—and built a directly substituting product. The emerging synthesis: a genuinely generative system that produces novel content serves a different function from the works it ingested; a tool that repackages the source's function for the same market does not. Commerciality weighs against the developers, but—consistent with Campbell and a long line of secondary-use cases—it rarely decides anything where the use is genuinely transformative. The uncomfortable corollary for Lumen Press is structural: the more general-purpose and generative a defendant's model, the stronger its first-factor position. A narrow tool that competes head-to-head with the plaintiff's product is far more exposed than a sprawling foundation model that does a thousand things, only one of which brushes against any given plaintiff.

Factor two: nature of the copyrighted work

The second factor considers whether the work is creative or factual, published or unpublished. Creative works sit closer to the core of copyright and get stronger protection, and training corpora are stuffed with them—novels, photographs, journalism, music. The Bartz and Kadrey courts both acknowledged that the books at issue were highly expressive works, weighing the factor for the plaintiffs. But factor two is the runt of the litter: courts treat it as rarely decisive standing alone, and its force fades further as the use becomes more transformative. In the AI cases it has been a tiebreaker at most.

Factor three: amount and substantiality of the portion used

The third factor asks how much was taken relative to the purpose—and here AI does something that looks fatal on its face: it copies works whole. Entire books, complete images, full articles, every line of a song. In isolation, taking the entire work cuts hard against fair use. Yet Bartz and Kadrey found the factor favored the developers, because wholesale copying was "reasonable in relation to the transformative purpose"—you cannot teach a model the statistics of language while feeding it only the first chapters. This, again, tracks the precedent: Google Books and HathiTrust both involved copying every word of every book, and both found the third factor satisfied because complete copying was necessary to the transformative search function. Thomson Reuters deemed the factor roughly neutral because the copying was intermediate—the headnotes never reached end users. The through-line: quantity is judged against purpose, never in the abstract. "You copied the whole thing" is not the trump card it appears to be when copying the whole thing is what the new purpose requires.

Factor four: effect on the market for the original

The fourth factor—long called the single most important—asks whether the use harms the market for the original or its derivatives. It is where the AI cases are most genuinely contested, because three distinct markets are in play, and the cases have not converged on how to count them.

The first is the market for the original works themselves. Do AI summaries of the news substitute for reading the journalism? Does a chatbot that can discuss a novel suppress demand to buy it? The Times argues ChatGPT's synthesis of its reporting bleeds subscription and advertising revenue; the developers respond that their products serve research and generation, not consumption-substitution. This is the classic, Harper & Row-style market-displacement inquiry, and it is strongest for plaintiffs where memorization lets users extract the actual protected content for free.

The second is the market for licensing training data itself—and here is where the doctrine gets genuinely circular and genuinely interesting. Developers argue that recognizing a "market for training licenses" begs the question: it would guarantee market harm in every case, because any unlicensed copying necessarily forecloses a license the plaintiff could have charged for. Plaintiffs respond that a training-licensing market is demonstrably forming—OpenAI is signing deals; the Bartz settlement put a public number on books—and that unauthorized copying destroys their ability to participate in a market that plainly exists. The courts have split. Bartz held that "the market for [training] data is not one that the [authors] are legally entitled to monopolize," refusing to let the mere theoretical availability of a license decide the factor. Thomson Reuters faulted ROSS for failing to negate a licensing market for Westlaw content, treating the foreclosed license as cognizable harm. Reconciling them is the work the Third Circuit will likely have to do.

The third, and newest, is the market for AI-generated competing products—Judge Chhabria's market-dilution theory in Kadrey. The idea: even if no single AI output substitutes for any single original, a model that can generate unlimited works in a genre may depress the value of every work in that genre by flooding the market. A romance novelist is not undone by any one machine-written romance; she is undone by ten thousand of them, sold for nothing, drowning the shelf. That theory was barely developed on the Kadrey record—Judge Chhabria practically circled it in red and handed it to the next plaintiff—and it is the rightsholders' most promising frontier. It is also the one Lumen's counsel should be building an evidentiary record for now: market surveys, pricing data, and expert economics showing that generative output measurably devalues the archive. Whoever first proves market dilution on a real record may bend the entire doctrine.

A worked hypothetical: running the factors on Verdantia

Suppose (this is hypothetical) Lumen Press sues Verdantia AI in California. Verdantia trained on three buckets: (a) e-books it purchased and licensed; (b) a public-web crawl; and (c) its early "uncertain provenance" datasets, which turn out to include a shadow-library dump of Lumen's catalog. Run the factors. Factor one favors Verdantia for the generative training use itself, post-Warhol, provided the model is general-purpose and not a Lumen-substitute. Factor two leans modestly to Lumen (creative works) but matters little. Factor three favors Verdantia for buckets (a) and (b)—whole-work copying is reasonable to the transformative purpose. But bucket (c) is the Bartz problem: the acquisition by piracy is not redeemed by the transformative use, and it exposes Verdantia to willful-infringement statutory damages on every Lumen title in the dump. Factor four is the battleground: if Lumen comes armed only with "you could have licensed to us," it may lose under Bartz; if it comes with hard market-dilution economics, it may win the war Judge Chhabria flagged. The upshot mirrors the real cases exactly: Verdantia probably survives on the use and gets crushed on the source—which is to say, it should have spent less on lawyers and more on a clean data pipeline.

The output problem: when the machine reproduces what it read

Training is the input question. There is an entirely separate output question, and the 2025–2026 rulings make a point the headlines often miss: the input and output questions can come out differently in the same case. A model can be lawful to train and still produce unlawful outputs—and the business that publishes one of those outputs has its own exposure, independent of the developer.

The clearest output scenario is memorization and regurgitation. Models sometimes memorize portions of their training data and reproduce them on demand. The Times documented near-verbatim article passages; Stable Diffusion has reproduced Getty's watermark; and the Munich court found whole German lyrics extractable with trivial prompts. Memorization is the plaintiffs' best output evidence precisely because it collapses the abstraction defense: when the original work comes back out intact, arguments about "mere statistics" lose their force.

The second scenario is style imitation—systems generating content "in the style of" a named creator. Copyright does not protect style as such, so "paint me a sunset like Monet" is not, by itself, infringement. But the line between unprotectable style and protectable expression blurs fast, especially for characters with fixed visual identities, which is exactly why the Midjourney studio cases are dangerous: "in the style of Star Wars" quickly becomes "here is Darth Vader," and Darth Vader is protected expression. (For where this collides with personal identity—voice, face, likeness—see the right of publicity meets digital doubles: deepfakes, AI avatars, and celebrity likeness.)

The third is derivative generation: prompts like "continue this story" or "reimagine this image" can yield outputs that are derivative works of the referenced source, infringing § 106(2) regardless of how the training was characterized.

For all of these, ordinary substantial-similarity analysis governs: does the generated content copy protectable expression from a specific source? If so, infringement can be made out regardless of the training-side fair-use result—a point both California rulings were careful to preserve. The corollary for businesses is sharp and often overlooked: a company that deploys AI-generated content bears its own infringement risk if the output copies someone's protected expression, and "the model made it" is not a defense to publishing it. Developers have responded with technical mitigations—content filters, prompt restrictions, output classifiers, training-data deduplication—whose legal significance is contested but whose presence or absence will plainly weigh in any secondary-liability or willfulness analysis. Verdantia's product team should treat output filtering not as a courtesy feature but as litigation infrastructure: it is the difference between "we built guardrails and they were imperfect" and "we built a vending machine."

Industry implications: licensing, opt-outs, and how creators get paid

The emerging licensing market—and its strange feedback loop

Whatever the courts ultimately decide, a parallel development is reshaping the field faster than any opinion: content owners are cutting deals. OpenAI has announced licensing agreements with the Associated Press, Axel Springer, News Corp, Condé Nast, the Financial Times, and many others; rival developers have followed; and the Bartz settlement, though formally a litigation resolution rather than a license, put a public price on books at scale—a per-work recovery in the low thousands of dollars that every future negotiation will anchor to, on both sides of the table.

These deals create a curious feedback loop with the litigation. As more publishers license, the defense argument that "no training-data market exists" erodes—a forming market is a market, and that strengthens plaintiffs on factor four. Yet the same deals let defendants argue that negotiated compensation already works, so litigation is unnecessary. The market also breeds a holdout problem: if major publishers can license while individual authors and photographers cannot realistically negotiate one-by-one with foundation-model labs, a two-tiered system emerges in which only the largest rightsholders are paid. Collective licensing—pooling rights and clearing them through an intermediary, as music does—is the obvious structural answer, and our analysis of music licensing in the streaming era shows both the promise and the bureaucratic complexity of that model. The strategic question for Lumen Press is sequencing: an early license monetizes the archive now but may quietly undercut its damages narrative later (it is awkward to argue a market was destroyed once you have sold into it), while holding out preserves the claims at the cost of present cash. There is no free move; counsel should war-game both before signing anything.

Opt-out mechanisms and technical standards: the European inversion

The European Union offers a structural alternative to American fair-use litigation: a text-and-data-mining (TDM) exception paired with opt-out rights. Under Article 4 of the EU Copyright Directive (Directive (EU) 2019/790), TDM on lawfully accessible content is permitted unless rightsholders have expressly reserved their rights through "machine-readable means." Article 3 provides a separate, narrower exception for scientific research by qualifying research organizations. The EU AI Act (Regulation (EU) 2024/1689) then layers transparency on top: providers of general-purpose AI models must put in place a policy to identify and respect Article 4 opt-outs and must publish a "sufficiently detailed summary" of training content using the Commission's mandatory template, published in July 2025, with the GPAI obligations applicable since August 2025.

This framework inverts the burden. In the U.S., the developer must affirmatively prove fair use. In the EU, the rightsholder must affirmatively opt out, or the mining is presumptively permitted. Critics object that many creators lack the technical sophistication to deploy machine-readable reservations and that compliance is honored unevenly; proponents counter that opt-out at least supplies clear ex ante rules instead of after-the-fact litigation lotteries. The GEMA ruling then added a crucial limit that should worry every developer relying on opt-out as a shield: the Munich court held the TDM exception covers the analytical mining phase but not the persistence of memorized works inside the trained model—meaning opt-out compliance does not immunize regurgitation. You can mine lawfully and still infringe if the model coughs the work back up.

The United Kingdom, meanwhile, consulted in December 2024 on adopting an EU-style TDM exception with opt-out. The proposal drew fierce, organized opposition from the creative industries and remains unresolved—leaving the UK, after the Getty judgment, in an awkward limbo with neither a fair-use doctrine nor a settled TDM regime. Mechanically, opt-outs travel through the same plumbing as scraping governance generally—robots.txt directives, metadata reservations, and emerging provenance protocols—the very terrain we map in data scraping after hiQ v. LinkedIn, where the interplay of copyright, contract, and computer-fraud theories determines what a scraper may lawfully take. And for policing individual infringing outputs already in the wild, rightsholders retain the conventional notice-and-takedown toolkit described in how to file a DMCA takedown notice and respond to one.

Compensation models for creators

Several models have been floated to bridge AI training practice and creator compensation, each with a characteristic flaw:

Compulsory (statutory) licensing would let developers train under a government-set royalty distributed through collecting societies—elegant in theory, brutal in the rate-setting and distribution mechanics that have bedeviled every compulsory-license regime that already exists.
Collective bargaining has already produced results in adjacent fields: SAG-AFTRA's 2023 agreement addressed AI use of performer likenesses, and author and artist guilds now seek analogous training-rights frameworks. The Authors Guild's central role in the Bartz claims process—driving a claims rate above ninety percent—is a real-world proof of concept for the organizing capacity collective approaches require.
Attribution and transparency systems, built on content provenance and watermarking, would credit creators whose works inform an output even where no payment changes hands.
Revenue sharing runs straight into the deepest problem in the field, the attribution problem: when millions of works contribute diffusely to a model's capabilities, assigning value to any single work is genuinely, perhaps intractably, hard. How much of a given sentence does any one of the ten thousand books that taught the model English deserve credit for?

The $1.5 billion Bartz fund offers one rough benchmark for what "getting it wrong" costs, and the licensing deals offer another for what "getting it right" costs. Both anchors will be cited in every negotiation for years. For a glimpse of how novel rights-and-royalty mechanisms can be engineered—and where they break—see our analysis of NFT marketplaces and secondary-sale royalties.

International perspectives: a genuine three-way split

The EU's approach differs fundamentally from the U.S. framework, and the first European rulings have sharpened, not softened, the divergence. The Copyright Directive's two TDM exceptions—Article 3 for research organizations and Article 4 for general (including commercial) mining subject to opt-out—were widely assumed to cover AI training. That assumption is now contested from both ends. In Kneschke v. LAION, the Hamburg courts held that a nonprofit's assembly of a training dataset benefited from the research exception, with the appellate decision in late 2025 emphasizing that rightsholders must deploy effective machine-readable opt-outs to defeat it. But the Munich court in GEMA held that Article 4's exception does not reach the trained model's memorization of protected works, and several member states have suggested AI training exceeds the TDM exception's intended scope altogether—raising the prospect of fresh EU legislation. Layered over all of this, the AI Act's copyright obligations apply to any provider offering general-purpose models in the EU, regardless of where training physically occurred—an extraterritorial reach that makes EU compliance unavoidable for U.S. developers like Verdantia.

The result, as of mid-2026, is a genuine three-way split:

A U.S. regime where transformative-use doctrine has so far protected training on lawfully acquired works, punished piracy ferociously, and left outputs exposed to ordinary substantial-similarity analysis.
A German (and potentially EU-wide) regime where memorization in the model is itself infringement requiring a license, and where opt-out compliance shields the mining but not the regurgitation.
A UK regime where the central questions remain unresolved because no training occurred onshore and legislative reform is stalled, leaving a vacuum that the Getty trademark sliver did little to fill.

For global developers, the operational rule writes itself: build to the strictest applicable standard, because you cannot geofence a foundation model away from the jurisdiction with the most aggressive rule. For global rightsholders, the corollary is equally clear: forum-shop. A music publisher with a memorization theory now knows exactly which courthouse in which country to walk into first.

Looking ahead: three end-states, and the one the rulings are tracing

Three broad equilibria remain logically possible. The 2025–2026 rulings strongly suggest the third.

Scenario one: fair use prevails broadly for training. The immediate practical change would be modest—developers already train on copyrighted content—but the long-run consequences for creative industries would be profound. Creators would lose the legal leverage to demand compensation; the value of content as training data would collapse toward zero; the nascent licensing market would wither; and investment in original content could decline as its economics erode. Developers might still license for quality, exclusivity, or reputation, but obligation would no longer drive it. This is the rightsholders' nightmare and, after Bartz and Kadrey on training, not an idle one.

Scenario two: training requires authorization across the board. Developers would face massive retrospective liability and a prospective rebuild. Existing models trained on unauthorized content could face damages or injunctions; new training would lean on licenses, public-domain corpora, and synthetic data; licensing would become a major line-item input cost; smaller labs might be priced out, entrenching the best-funded incumbents (an ironic outcome for a regime meant to protect the little guy); and the pace of progress could slow as data acquisition turns slow and expensive. Retroactive enforcement would be genuinely hard given the scale of copying already done and the near-impossibility of computing damages across billions of works—difficulties the Bartz settlement machinery, with its half-million-work class and ninety-percent claims rate, only begins to illustrate.

Scenario three: the middle path—and it is the one the early rulings are actually tracing. Training on lawfully acquired works is treated as transformative; acquisition by piracy is punished severely; outputs that reproduce protected expression remain infringing regardless of the training analysis; opt-out mechanisms gain legal recognition in Europe and contractual recognition everywhere; and compensation flows through a thickening web of licenses, settlements, and collective arrangements rather than through a single doctrinal switch flipped on or off. Different rules for different content types (factual versus creative) and different product types (generative versus substitutive) are already visible in the Thomson Reuters/Bartz divide. The Third Circuit's pending decision in ROSS, the SDNY MDL, the Midjourney character cases, and the German appeal in GEMA will each refine the map. Congressional intervention remains possible but, given the political economy of copyright reform and the speed of the courts, unlikely to outpace the judges.

Practical considerations for stakeholders

For AI developers, the priorities are no longer speculative—they are dictated by the case law. Document training-data provenance rigorously, and segregate or remediate any corpus of doubtful origin; Bartz prices that risk at ten figures, and the fix is cheaper than the settlement by three orders of magnitude. Respect opt-out signals and robots.txt directives, both for EU compliance and for the equities in any future fair-use fight—courts notice who played fair. Implement and document output filters and guardrails; they mitigate the output-side exposure that survives even a training-side win, and their absence is the studios' core allegation against Midjourney. Seek indemnification from third-party data vendors. Weigh licensing deals with major content owners as both legal insurance and competitive moat. And track the litigation actively—the operative law has changed materially every few months for two years running.

For content owners like Lumen Press, the menu runs from licensing to litigation to opt-out, and the options interact rather than sit independently. Evaluate any license against the damages narrative a deal might compromise. Implement machine-readable opt-outs to preserve EU rights now, before you need them. Monitor the major models for reproduction of your content—the plaintiffs' prompting playbook is well developed, as are the defendants' objections to plaintiff-engineered prompts, so build your evidence carefully. Consider collective action through industry groups, whose leverage the Bartz claims process vividly demonstrated. And start documenting the value of your archive as training data—pricing, demand, and dilution evidence—to support both negotiations and the market-dilution factor-four case that the next round of litigation will reward.

For counsel on either side, the job now is tracking a body of law that spans three continents and a dozen procedural postures; advising on the sharply different risk profiles of training-based versus output-based claims; thinking hard about how today's licensing posture will read in tomorrow's litigation; preparing for genuinely novel discovery—model internals, prompt logs, retained user conversations, training-data manifests; and anticipating EU AI Act obligations that bind clients regardless of where their models were trained. It is, in short, a practice area where yesterday's confident memo can be wrong by next quarter.

Frequently asked questions

Is it illegal to train an AI model on copyrighted works? In the United States, it depends—and the dependency is now reasonably well defined. Two federal courts (Bartz and Kadrey) held in 2025 that training a generative model on lawfully acquired copyrighted books was transformative fair use. But Bartz equally held that acquiring the works by piracy is not fair use no matter how transformative the training. And Thomson Reuters held that copying to build a competing, substitutive product is not fair use even before you reach the piracy question. So lawful acquisition plus genuinely generative, non-substitutive use is the safest posture; piracy or head-to-head competition is the dangerous one. None of these are appellate holdings yet—Bartz settled, and ROSS is on appeal—so the law could still shift.

What is the difference between the "input" and "output" problems? The input problem is whether copying works into the training set infringes (largely a fair-use question). The output problem is whether the model's generated content reproduces a specific protected work (a substantial-similarity question). They are independent: a model can be lawful to train and still produce infringing outputs, and the 2025–2026 rulings were careful to preserve output liability even where they blessed the training.

Can the output of a generative AI infringe copyright? Yes. If a generated text, image, or audio copies the protectable expression of a specific source—near-verbatim article passages, a recognizable copyrighted character, whole song lyrics—ordinary substantial-similarity analysis applies and infringement can be found regardless of how the training is characterized. Importantly, the user or business that publishes the output can be liable too; "the AI made it" is not a defense.

Can I copyright something I made with AI? Only the parts a human authored. Per Thaler v. Perlmutter (cert. denied March 2026) and current Copyright Office guidance, a work generated autonomously by AI, with no human creative control, is not copyrightable. Where a human meaningfully selects, arranges, and edits AI output, the human's original contributions may be protected—but the registration must disclaim the purely machine-generated material.

Does the EU treat AI training differently from the U.S.? Fundamentally. The EU has no fair-use doctrine; instead it permits text-and-data mining under Articles 3–4 of the Copyright Directive unless the rightsholder opts out by machine-readable means, and the AI Act adds transparency and opt-out-compliance duties for general-purpose model providers offering services in the EU—regardless of where training occurred. The GEMA ruling further held that the TDM exception does not cover a model's memorization of works, so opt-out compliance does not immunize regurgitation.

What was the $1.5 billion Anthropic settlement, and is it binding precedent? It is the proposed class settlement in Bartz v. Anthropic—the largest copyright settlement in U.S. history—resolving claims arising from Anthropic's downloading of pirated books. Because the case settled, the favorable training-is-fair-use holding was never appealed and the settlement creates no binding precedent. Its lasting lesson is practical: data provenance, not doctrine, drove the number.

What is the "market dilution" theory, and why does it matter? It is Judge Chhabria's fourth-factor concern in Kadrey: even if no single AI output substitutes for a single original, a model that can generate unlimited works in a genre may depress the value of every work in that genre. It was undeveloped on the Kadrey record, but it is the rightsholders' most promising frontier—the plaintiff who first proves it on real economic evidence could reshape the fair-use calculus.

Conclusion

The copyright battles over generative AI are a defining moment for technology and the creative economy alike. For a while, the questions—whether training is transformative, whether outputs infringe, how to value training data as a market—genuinely lacked precedent, because the technology was unprecedented. That is no longer quite true. Courts have now held that training a generative model on lawfully acquired books is transformative fair use; that building a pirate library to do it is not; that a non-generative tool competing with its source loses the fair-use fight; that a model's memorization of lyrics is, in Germany, itself an act of reproduction; and that AI-generated works without human authors get no copyright at all. The largest copyright settlement in American history has put a hard number—$1.5 billion—on the cost of getting data provenance wrong.

Yet the deepest questions remain open. The market-dilution theory awaits a fully built record. The studio cases over the world's most recognizable characters are just beginning. The Third Circuit has not yet spoken, the SDNY MDL is mid-stream, and the German judgment is headed for appeal against the backdrop of possible EU legislation. The stakes extend far beyond the named parties: the outcomes will determine whether AI development continues on scraped abundance or pivots to licensed and synthetic data, whether creators share in the value their works contribute, and whether the United States, Europe, or neither writes the global rules.

What is clear is that the era of pure uncertainty is ending, and an equilibrium—judicial, legislative, and contractual all at once—is beginning to form. The Betamax took years to settle into law; the doctrine that emerged (capable of substantial non-infringing uses) governed an entire generation of technology. The doctrine the AI cases are now forging will do the same. For Lumen Press and Verdantia AI, and for the real companies they stand in for, the task for the next several years is to navigate the transition: stay current with case law that changes quarterly, understand the sharply divergent frameworks across jurisdictions, and treat both the risks and the genuine opportunities of a technology that is rewriting the economics of expression with the seriousness they deserve.

This article is for informational purposes only and does not constitute legal advice. The law in this area is evolving rapidly—several of the rulings discussed here issued within months of publication—and specific situations require individualized analysis by qualified counsel.

Selected authorities

Statutes and directives: 17 U.S.C. §§ 102, 106, 107 (subject matter; exclusive rights; fair use); 17 U.S.C. § 412 (registration as precondition to statutory damages and fees); 17 U.S.C. § 501 (infringement); 17 U.S.C. § 504(c) (statutory damages: $750–$30,000 ordinary; up to $150,000 willful; as low as $200 innocent); 17 U.S.C. § 1202 (copyright management information; $2,500–$25,000 per violation); EU Directive 2019/790 (CDSM Directive), arts. 3–4 (text and data mining); Regulation (EU) 2024/1689 (EU AI Act) (GPAI copyright-policy and training-data-summary obligations); German Copyright Act (UrhG) §§ 15, 16, 19a, 44b, 60d.

Cases and decisions: The New York Times Co. v. Microsoft Corp. (S.D.N.Y., filed Dec. 2023; motion to dismiss largely denied Mar. 2025); In re OpenAI, Inc. Copyright Infringement Litigation (S.D.N.Y. MDL); Getty Images v. Stability AI [2025] EWHC (Ch) (Eng., judgment 4 Nov. 2025) and parallel U.S. action (D. Del.); Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence Inc. (D. Del. Feb. 2025), interlocutory appeal pending (3d Cir.); Bartz v. Anthropic PBC, No. 24-cv-05417 (N.D. Cal. June 23, 2025) (training fair use; pirated-library acquisition not fair use), proposed $1.5 billion class settlement (preliminary approval 2025; fairness hearing May 14, 2026); Kadrey v. Meta Platforms, Inc., No. 23-cv-03417 (N.D. Cal. June 25, 2025); Concord Music Group v. Anthropic (M.D. Tenn./N.D. Cal.; 2024 guardrails agreement); GEMA v. OpenAI, LG München I, Case No. 42 O 14139/24 (Nov. 11, 2025); Kneschke v. LAION (LG/OLG Hamburg, 2024–2025); Disney Enterprises v. Midjourney and consolidated Warner Bros. action (C.D. Cal., filed 2025); Thaler v. Perlmutter, No. 23-5233 (D.C. Cir. 2025), cert. denied (U.S. Mar. 2, 2026). Foundational fair-use and copying authority: Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023); Google LLC v. Oracle America, Inc., 593 U.S. 1 (2021); Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994); Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417 (1984); Authors Guild, Inc. v. Google, Inc., 804 F.3d 202 (2d Cir. 2015); Authors Guild, Inc. v. HathiTrust, 755 F.3d 87 (2d Cir. 2014); Sega Enters. Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992); Sony Computer Entm't, Inc. v. Connectix Corp., 203 F.3d 596 (9th Cir. 2000); Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146 (9th Cir. 2007); Erickson Prods., Inc. v. Kast, 921 F.3d 822 (9th Cir. 2019).

Agency and secondary materials: U.S. Copyright Office, Copyright and Artificial Intelligence report series (2024–2025); European Commission GPAI training-data summary template (July 2025); UK IPO consultation on copyright and AI (Dec. 2024); Authors Guild and Authors Alliance materials on the Bartz settlement claims process (2025–2026). Case postures change rapidly; confirm current status before relying on any matter described here.