The Same Click, Three Different Laws
Imagine a company we will call Atlas Analytics. Atlas writes software that visits public web pages and collects the information it finds there. It does not pick a lock, guess a password, or break through any wall; it reads pages that any person with a browser could read, only faster and at scale. And yet, depending on what it collects, whose site it visits, how it gets in, and what it does with the data afterward, Atlas may be doing something perfectly lawful—or it may be breaching a contract, committing a federal computer crime, or infringing copyright on a massive scale. The maddening feature of web-scraping law is that the same physical act—an automated program downloading a public page—can fall under three different bodies of law that do not agree with one another, were written in different decades for different problems, and were never reconciled.
This is no longer a niche problem for a handful of analytics shops. The hunger of artificial-intelligence systems for training data has transformed scraping from an obscure technical practice into a frontline legal battleground, because generative models need staggering quantities of text, images, code, and other content, and much of it is harvested from the open internet. Two audiences with opposing interests now need to understand the rules with precision. Data collectors—AI developers, analytics firms, competitive-intelligence shops, academic researchers—need to know how to gather data without courting ruinous liability. Website operators and content creators need to know which legal tools actually protect their content from unwanted harvesting and which are paper tigers. This article maps the terrain for both, tracing the landmark cases that have begun to fix the boundaries and translating them into practical guidance. We will keep returning to Atlas to see how each doctrine applies to a single, concrete collection decision. For the closely related question of how copyright fair use is shaking out in AI training specifically, our analysis of copyright infringement claims against generative AI is the natural companion, and our overview of artificial intelligence key legal issues sets the broader context.
A word on what makes this area genuinely hard rather than merely complicated. Most legal questions resolve to a single doctrine: was the contract breached, was the patent infringed, was the statute violated. Scraping resolves to three at once, and the doctrines are independent. A scraper can comply with two of them and still be liable under the third. Worse, the three doctrines reward opposite behaviors. The computer-fraud analysis pushes a scraper toward public, logged-out collection; the copyright analysis sometimes punishes the very wholesale copying that public collection invites; and the contract analysis turns on whether the scraper ever clicked "I agree," which can flip the safest computer-fraud posture into a clear breach. Understanding the field means holding all three in your head simultaneously and seeing which fact lights up which doctrine.
The CFAA: From Hacking Statute to Scraping Battleground
The Computer Fraud and Abuse Act, 18 U.S.C. § 1030, enacted in 1986 and amended many times since, was written to criminalize hacking—intrusions into protected computer systems. Congress had in mind the WarGames image of a teenager dialing into a military mainframe, not a Python script downloading public résumés. The statute imposes criminal and civil liability on anyone who "intentionally accesses a computer without authorization or exceeds authorized access" and thereby obtains information from a protected computer (18 U.S.C. § 1030(a)(2)(C)). The reach is breathtakingly broad on its face: a "protected computer" includes any computer "used in or affecting interstate or foreign commerce," which courts have long read to mean essentially any computer connected to the internet (see United States v. Nosal, 676 F.3d 854, 859 (9th Cir. 2012)). Courts have applied the term "computer" to websites, cell phones, restricted databases, and even videogame consoles. If a server is on the internet, the CFAA potentially reaches it.
For decades, courts struggled to apply the statute's language to conduct far removed from a hacker cracking a secured server, and the trouble centered on a single phrase. The statute defines "exceeds authorized access" as accessing a computer "with authorization" but then using that access to obtain information the accesser "is not entitled so to obtain" (18 U.S.C. § 1030(e)(6)). Does that reach a user who has general permission to use a system but violates its terms of service—or only a user who reaches into areas technically off-limits? As one influential opinion put it, the line between "without authorization" and "exceeds authorized access" is "paper thin" (Int'l Airport Ctrs., L.L.C. v. Citrin, 440 F.3d 418, 420 (7th Cir. 2006)).
The circuits split, and the stakes were enormous. Under the broad reading—adopted in various forms by the First, Fifth, and Eleventh Circuits—violating any computer-use policy could be a federal crime, so that using a work computer for personal email against an employer's policy might theoretically be criminal, and a website could convert any terms-of-service ban into a federal-law violation simply by writing it down. Under the narrow reading—favored by the Second, Fourth, Sixth, and Ninth Circuits—the CFAA reached the circumvention of technological barriers like passwords, not mere violations of policies or terms (see WEC Carolina Energy Sols. LLC v. Miller, 687 F.3d 199, 205–06 (4th Cir. 2012); Royal Truck & Trailer Sales & Serv., Inc. v. Kraft, 974 F.3d 758 (6th Cir. 2020)). For scraping, the difference was existential. If the broad reading prevailed, any website could turn off competitive intelligence, comparison shopping, academic research, and journalism with a single line of boilerplate. If the narrow reading prevailed, the open internet stayed open.
It is worth pausing on a feature that makes the CFAA particularly fearsome as a scraping weapon, and that the secondary literature emphasizes: civil plaintiffs must clear a $5,000 loss threshold within a one-year period to bring a claim (18 U.S.C. § 1030(c)(4)(A)(i)(I); § 1030(g)). That sounds like a meaningful hurdle, but in practice it is low. Courts have allowed plaintiffs to count the costs of a forensic investigation, a damage assessment, security enhancements, and even the value of employee time spent investigating an intrusion—and have allowed those costs to count even when the investigation ultimately reveals no physical damage at all (see EF Cultural Travel BV v. Explorica, Inc., 274 F.3d 577, 584 (1st Cir. 2001)). A target with good lawyers and a forensics vendor can almost always assemble $5,000. The threshold rarely saves a scraper; the authorization question is where cases are won and lost.
Van Buren: the Supreme Court Picks the Narrow Reading
The Supreme Court resolved the split in Van Buren v. United States, 593 U.S. 374 (2021). A Georgia police officer had accepted money to run a license-plate search in a law-enforcement database he was authorized to use for official purposes, and he was prosecuted on the theory that searching for an improper purpose "exceeded authorized access." In a 6–3 opinion by Justice Barrett, the Court rejected that theory. A person "exceeds authorized access," the Court held, only by obtaining information located in areas of a computer that are off-limits to them—files, folders, or databases their access does not extend to. An improper motive for obtaining information one is otherwise allowed to obtain is irrelevant.
The Court captured this in a now-famous "gates-up-or-down" metaphor: liability turns on whether there was a technological gate blocking access that the defendant went around, not on the defendant's reasons for walking through an open door. The majority worried, pointedly, about the alternative: under the government's reading, "millions of otherwise law-abiding citizens" would be criminals for checking sports scores or sending personal email at work in violation of a policy. The Court refused to let the CFAA "criminalize everything from embellishing an online-dating profile to using a pseudonym on Facebook."
For scraping, the implication was immediate. If CFAA liability depends on a technological barrier rather than a policy violation, then scraping a publicly accessible website—one with no gate at all—would seem to fall outside the statute. Van Buren addressed only the "exceeds authorized access" clause, technically leaving the separate "without authorization" clause and its application to public sites for another day. But the Court's reasoning—gates, not motives—mapped directly onto the question that had already been litigated for years in the case that defines this whole area.
hiQ v. LinkedIn: Six Years, and a Paradox
hiQ Labs scraped data from public LinkedIn profiles—information members had chosen to make visible to anyone, logged in or not—and used it to build workforce-analytics products predicting which employees might quit. In May 2017, LinkedIn sent a cease-and-desist letter asserting violations of the CFAA, state law, and its User Agreement, and deployed technical measures to block hiQ's bots. Rather than comply, hiQ went on the offensive, suing for a declaration that its scraping was lawful and an injunction against the blocking. The district court granted the injunction in August 2017, and the Ninth Circuit affirmed in September 2019 in a comprehensive opinion holding that scraping publicly accessible data generally does not violate the CFAA's "without authorization" clause (hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019)). The logic was clean: the CFAA was designed to prevent digital break-ins, and where data is open to the public with no authorization required to view it, there is no authorization requirement to violate. The court drew the now-central line between "closed" systems requiring authentication and "open" systems available to anyone. As the court memorably put it, the CFAA is about breaking into a house, not walking through a front door someone left wide open.
LinkedIn sought Supreme Court review, and in June 2021—just days after Van Buren came down—the Court granted certiorari, vacated, and remanded for reconsideration in light of the new decision. On remand in April 2022, the Ninth Circuit reaffirmed, finding that Van Buren reinforced rather than undermined its reasoning (hiQ Labs, Inc. v. LinkedIn Corp., 31 F.4th 1180 (9th Cir. 2022)). The "gates-up-or-down" framework presupposes gates; where none exist, the CFAA does not apply. For Atlas, this is the foundational rule of the field: scraping genuinely public pages, without circumventing any authentication, generally does not trigger the CFAA.
But the story did not end in victory for scrapers, and the ending is the most important lesson of all—the one that practitioners who only skim the headlines miss. The Ninth Circuit had addressed only the CFAA claim, and only at the preliminary-injunction stage; LinkedIn pressed on with breach of contract and common-law theories that the appellate ruling never touched. In August 2022 the district court dissolved the injunction after hiQ had effectively ceased operations, and in November 2022 it granted summary judgment that proved devastating. hiQ, the court found, had agreed to LinkedIn's User Agreement by creating accounts, and the User Agreement's bans on automated collection and on fake profiles were enforceable contract terms. Even though the CFAA did not reach hiQ's conduct, hiQ could still be liable for breach of contract for violating terms it had accepted—and its use of fake accounts ("turkers" hired to create profiles that fed the scraping) to access data only strengthened LinkedIn's hand. On December 6, 2022, the parties settled with a stipulated permanent injunction requiring hiQ to stop scraping LinkedIn and delete everything derived from scraped data, plus $500,000 in damages and a stipulation to liability—though, as a private settlement, that stipulation binds only hiQ and is not precedent for anyone else.
The result is a genuine paradox that frames everything else in this article. The Ninth Circuit's CFAA rulings remain good law, so scraping public data is not a federal crime—yet hiQ ultimately lost, because contract law gave LinkedIn an enforcement tool the CFAA could not. The case set both a floor and a ceiling. Scraping public data is not federally criminal, but it may still breach an enforceable contract, trespass on a server, or infringe a copyright. The lesson the technology press took from hiQ—"scraping public data is legal!"—is half right and dangerously incomplete. The lesson the lawyers took is more useful: the CFAA is often the weakest weapon a website has, which is exactly why the fight has moved to contract and copyright.
It is worth noting two related CFAA cases the hiQ court had to reckon with, because they mark the outer edge of the "open door" rule and show how a cease-and-desist letter can sometimes matter. In Facebook, Inc. v. Power Ventures, Inc., 844 F.3d 1058 (9th Cir. 2016), the Ninth Circuit held that Facebook had revoked the defendant's authorization by sending a cease-and-desist letter and deploying IP blocks, so continued access became "without authorization." Similarly, in Craigslist Inc. v. 3Taps Inc., 964 F. Supp. 2d 1178 (N.D. Cal. 2013), a court found that a scraper who kept accessing a public site after receiving a cease-and-desist and IP block acted "without authorization." The hiQ court distinguished these by emphasizing that they involved sites or systems with some authentication overlay or individualized revocation, not purely public pages—but the tension is real, and a careful scraper treats a cease-and-desist plus IP block as a meaningfully higher-risk signal than terms of service alone. The district court in Ticketmaster L.L.C. v. Prestige Entertainment, Inc., 2018 WL 654410 (C.D. Cal. Jan. 31, 2018), pulled the other way, dismissing a CFAA claim where the cease-and-desist letter merely scolded the defendant for violating the terms of use without expressly revoking access. The drafting of the letter, in other words, can carry legal weight.
The Contract Cases: Who Is Actually Bound?
If contract is the live theory—and after hiQ it plainly is—the decisive question becomes whether the scraper ever agreed to anything. A contract requires offer, acceptance, and consideration; it requires assent. And a pair of cases involving the same defendant sharply narrowed website operators' ability to bind anonymous scrapers who never assented to anything.
In Meta Platforms, Inc. v. Bright Data Ltd., No. 23-cv-00077 (N.D. Cal. Jan. 2024), Meta sued an Israeli scraping company that collected and sold data from public portions of Facebook and Instagram, relying on breach of its terms of service. Judge Edward Chen—who had also presided over hiQ, and so knew the terrain intimately—granted summary judgment for Bright Data, holding that Meta's terms did not prohibit "logged-off" scraping of public data. The terms governed "your use" and restricted what "users" could do, and Bright Data argued that when it scraped without logging into any account, it was not a "user" bound by the agreement. The court agreed: the terms contemplated restrictions on account holders actively using their accounts, not on anonymous visitors or automated programs reading publicly visible content. The court also rejected the argument that merely visiting a site and accepting cookies formed a contract, and noted—tellingly—that Meta had once included "by accessing Facebook you agree" language and had removed it, which suggested an intent to limit the terms to registered users. Meta had, in effect, drafted itself out of the protection it now wanted.
The same defendant prevailed again in X Corp. v. Bright Data Ltd. in California state court, where the court dismissed X's terms-of-service claims in 2024 on parallel reasoning: Bright Data's logged-off scraping of public data did not make it a "user" bound by the terms, and terms of service cannot form a contract with parties who never manifested assent through registration or other affirmative conduct. (A related federal action in the Northern District of California likewise narrowed X's theories, though the cases turned in part on a tension between barring scraping and X's own competitive interests.) Together these decisions point to an emerging consensus: platforms cannot unilaterally bind every visitor to their public pages. Enforceable contract claims require actual contract formation—and that usually means the defendant created an account or otherwise affirmatively agreed.
The deeper doctrinal point is how assent is formed, and it explains both the Bright Data outcomes and how a website operator might draft around them. Courts distinguish between clickwrap and browsewrap agreements. A clickwrap agreement requires the user to take an affirmative step—checking a box, clicking "I Accept"—before proceeding, and courts routinely enforce clickwrap precisely because the user manifestly assented. A browsewrap agreement purports to bind users merely by their use of the site, often through a terms link buried in a footer, and courts enforce it only where the user had actual or constructive notice of the terms. That is a much harder showing, and one that frequently fails when the link is inconspicuous and nothing the user did signaled agreement. The leading case is Nguyen v. Barnes & Noble Inc., 763 F.3d 1171, 1176–77 (9th Cir. 2014), which refused to enforce browsewrap terms against a consumer who had no actual knowledge of them, holding that the conspicuousness of the terms and the absence of any affirmative manifestation of assent doomed the agreement. Courts will enforce browsewrap where the operator can prove actual notice—through, say, a cease-and-desist letter that put the defendant on notice, or a prominent on-screen warning—which is one more reason cease-and-desist letters matter even when the CFAA does not.
The Bright Data cases sit at the extreme end of this spectrum. A logged-off automated program not only never clicked "I agree" but is not even the kind of "user" the terms addressed; there was no assent of any kind to enforce. This is why the standard advice to operators is to convert browsewrap into clickwrap for anything they truly want to bind—requiring registration and an explicit click before access—and why the advice to scrapers is the mirror image: the act of registering is itself the act of assenting. A scraper that never creates an account has, in most cases, never clicked the box; a scraper that registers has, and the box it clicked very likely forbids what it is about to do. (For the mechanics of drafting enforceable online agreements, our guides on drafting software license agreements and software licensing agreements walk through the assent-and-notice problem in detail, and social media law basics covers platform terms generally.)
Atlas at the Crossroads: One Decision, Three Doctrines
To see how these bodies of law stack on a single choice, follow Atlas as it plans to collect product listings and customer reviews from a retail marketplace to build a pricing-analytics tool. (This is a hypothetical, offered to illustrate the doctrine, not a description of any real company or dispute.) The same target site presents Atlas with three distinct legal questions, and the answer to each turns on a different fact.
The CFAA question asks only whether there is a technological gate. If the listings and reviews are visible to any visitor without logging in, there is no gate, and under Van Buren and hiQ the CFAA is not in play—Atlas is reading open pages. But if some reviews are visible only to logged-in members, and Atlas creates accounts (or, worse, fake accounts) to reach them, it has crossed into gated territory and put itself squarely within the statute's reach, because now there is a gate and Atlas has gone around it.
The contract question asks whether Atlas ever agreed to anything. If Atlas scrapes logged-out, never registering, the Bright Data cases suggest the marketplace's user-only terms do not bind it. But if Atlas registers an account to see more—accepting the terms of service in the process—it has very likely agreed to a clause prohibiting automated collection, and it can be liable for breach of contract even though the underlying data was public and the CFAA was never triggered. This is precisely how hiQ lost a case the technology press thought it had won.
The copyright question asks what Atlas copies and what it does with it. Factual data—prices, product names, SKUs—is not protected by copyright at all, as we explain in the next section, so collecting and analyzing it raises little copyright risk. But the reviews are original expression written by users, and if Atlas reproduces them wholesale, or trains a generative model that can regurgitate them, it has implicated the reproduction right and must rely on a fair-use defense whose outcome, as the AI cases show, is genuinely uncertain.
The lesson is that there is no single answer to "is scraping this site legal?" Atlas's safest path threads all three needles: collect only the publicly visible data, stay logged out so no contract forms, take the factual data rather than wholesale copies of expressive content, and document a legitimate analytical purpose. Change any one of those facts—log in, accept terms, copy the reviews verbatim, train a model that reproduces them—and a different doctrine lights up. Because the doctrines are independent, a scraper can comply with two and still be liable under the third.
The Copyright Status of the Data Itself: Why Feist Matters
Before reaching the headline-grabbing AI fair-use fights, a more basic copyright question deserves attention, because it governs the bulk of ordinary scraping and is widely misunderstood: is the scraped data even protected by copyright in the first place? For most factual scraping, the answer is no, and the reason is Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991).
Feist is the bedrock. A telephone company sued a directory publisher for copying thousands of white-pages listings. The Supreme Court held, unanimously, that facts are not copyrightable—not because the copier deserves a break, but because facts do not "owe their origin to an act of authorship." Originality, the Court explained, requires "independent creation plus a modicum of creativity," and a fact is discovered, not created. The Court drove a stake through the old "sweat of the brow" theory, which had rewarded the labor of compiling data: effort alone earns no copyright. As the Court put it, copyright "rewards originality, not effort." Arranging names alphabetically in a phone book is "garden-variety" and "devoid of even the slightest trace of creativity," so the directory's white pages were not protectable, and copying 4,000 listings was not infringement.
For scrapers, Feist is liberating and limiting at once. A compilation of facts can earn a thin copyright—but only in the original selection, coordination, and arrangement of the facts, never in the facts themselves (17 U.S.C. § 101 (definition of "compilation"); § 103). So Atlas can scrape prices, addresses, stock tickers, sports scores, ingredient lists, court docket numbers, and product specifications all day long without committing copyright infringement, because those are facts. What Atlas cannot freely copy is a creatively curated arrangement—and even then, the protection is "thin," meaning a competitor can take the underlying facts and arrange them differently without liability. A database's structure might be protectable; the data flowing through it usually is not. This is why competitive-intelligence and analytics scraping of factual data sits, copyright-wise, on comfortable ground, and why the real copyright fights involve expressive content—articles, photographs, reviews, code, and books—rather than raw numbers.
The contrast inside Atlas's own project illustrates the line perfectly. The product listings are facts; the customer reviews are original literary expression. Atlas can compile the prices into a pricing index without copyright worry. Reproduce the reviews verbatim, or train a model that spits them back out, and Atlas has crossed from the Feist-protected zone of free facts into the contested zone of expressive copying—where it needs a fair-use defense. For readers who want the underlying doctrine, our copyright FAQs and the guide to copyright registration of websites and website content explain what website content is and is not protectable, and legal protection of software covers the special case of scraped code.
Copyright and AI Training: The Frontier Fights
CFAA and contract dominated the analytics-era scraping fights, but generative AI has thrust copyright to the front, because training models requires copying enormous quantities of copyrighted work—text, images, code, music—gathered from across the web. Scraping copyrighted content and training on it potentially implicates the reproduction right (copies are made when content is scraped, stored, preprocessed, and used in training) and possibly the derivative-work right. AI developers defend chiefly through fair use, the doctrine codified at 17 U.S.C. § 107 that permits unauthorized use for purposes like criticism, comment, news reporting, teaching, scholarship, and research, analyzed through four non-exclusive factors:
- the purpose and character of the use, including whether it is commercial and whether it is "transformative";
- the nature of the copyrighted work (creative works are more protected than factual ones);
- the amount and substantiality of the portion used; and
- the effect on the potential market for or value of the original.
These factors are old—the doctrine descends from Justice Story's opinion in Folsom v. Marsh, 9 F. Cas. 342 (C.C.D. Mass. 1841)—and the Supreme Court has repeatedly cautioned that there are no bright-line rules; fair use is "an equitable rule of reason" balanced case by case (Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 577 (1994); Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 448 (1984)). The single most important development in modern fair-use law is the rise of "transformativeness": a use that adds new expression, meaning, or purpose, rather than merely substituting for the original, weighs heavily toward fair use. The most cited application in the scraping context predates the AI boom: in Authors Guild, Inc. v. Google, Inc., 804 F.3d 202 (2d Cir. 2015), the Second Circuit held that Google's wholesale digitization of twenty-million-plus books to enable full-text search—copying entire works, commercially, without permission—was nonetheless fair use, because the search function served a transformative purpose and did not substitute for reading the books. Earlier still, Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146 (9th Cir. 2007), found that copying images to create search thumbnails was transformative. AI developers lean hard on this lineage: training, they argue, is the new search—copying for a transformative analytical purpose, not to republish.
In May 2025 the U.S. Copyright Office released the third part of its report on copyright and artificial intelligence, focused on generative-AI training, which set the tone for the litigation. The Office concluded that the question demands case-by-case analysis—that some training uses will be fair and some will not. It identified factors weighing against fair use, including commercial use of vast quantities of works to produce content that competes with the originals, acquiring works through piracy, and deploying models without guardrails against reproducing copyrighted content in outputs; and factors weighing for fair use, including noncommercial research, uses that do not enable reproduction of the works in outputs, and genuinely transformative purposes. The report is not binding on courts, but its nuanced, fact-intensive framing has proven prophetic, because the cases have split along almost exactly these lines.
It is worth walking the four factors concretely, because they are where these cases are actually decided, and applying them to Atlas's hypothetical use of scraped reviews shows how finely balanced the analysis is. The first factor—purpose and character—asks whether the use is transformative and whether it is commercial. Training a model to extract statistical patterns, rather than to republish the reviews, leans transformative, as Kadrey and Bartz found; but Atlas's purpose is commercial, which pulls the other way, and if the model can reproduce the reviews in its outputs, the "transformative" claim weakens considerably—a model that regurgitates its inputs starts to look like a substitute, not a transformation. The second factor—the nature of the work—asks how creative the copied material is; the reviews are expressive original writing, more protected than the bare factual prices Atlas also collects, so this factor favors the review authors. The third factor—amount used—is double-edged in AI cases: training typically copies works in their entirety, which ordinarily disfavors fair use, yet courts have accepted (following Authors Guild and Perfect 10) that copying the whole work can be reasonable where the technical purpose requires it. The fourth factor—market effect—is often decisive and is where the AI cases are most unsettled: if Atlas's tool, or the publishers it serves, would otherwise have licensed the reviews, and a licensing market for training data exists or is emerging, that market harm weighs heavily against fair use. This is precisely the point on which Kadrey warned that a fuller record might flip the result, and the reason the explosive growth of AI-training licensing deals is steadily tilting the fourth factor against developers. No single factor controls, and the same use can come out differently as the record—especially on market harm—develops, which is exactly why the Copyright Office refused to draw a bright line and why Atlas cannot treat any training use of expressive content as categorically safe.
Where the Key Cases Stand
Several decisions now illuminate the emerging framework—and because this area is moving fast, their current procedural posture matters as much as their holdings. Today's "win" is tomorrow's appeal.
Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence Inc. addressed whether training an AI legal-research tool on Westlaw's copyrighted headnotes was fair use. ROSS had obtained Westlaw content through a third party (after Thomson Reuters refused to license to a competitor) and used it to build a rival legal-research AI. In February 2025, Judge Stephanos Bibas (a Third Circuit judge sitting by designation in the District of Delaware) granted partial summary judgment to Thomson Reuters, rejecting ROSS's fair-use defense. He found the use non-transformative because ROSS used the headnotes for the same purpose Thomson Reuters intended—facilitating legal research—and aimed to compete directly; and he found potential harm to both the primary market and the market for licensing data to train AI tools. This was the first significant merits ruling against an AI defendant on fair use, and it landed before the more developer-friendly decisions discussed below. Crucially, it is not the last word: the district court certified the questions for interlocutory appeal, and the Third Circuit—the first federal appeals court to review fair use in AI training—heard oral argument on June 11, 2026, with a decision pending. The appellate ruling, when it comes, will carry far more weight than the district-court opinion and could reshape the entire analysis, particularly since the trial court decided the case before the two influential mid-2025 decisions came down.
Kadrey v. Meta Platforms Inc. cut the other way. Authors sued Meta for training its Llama models on books obtained partly through legitimate channels and partly from "shadow libraries"—pirate repositories. In June 2025, Judge Vince Chhabria granted Meta summary judgment on fair use, finding the training highly transformative: the books exist to be read, while Meta's purpose was to extract statistical patterns to power a text generator, a fundamentally different end. But the decision was pointedly fact-specific and almost grudging. The court found insufficient evidence that Llama's outputs displaced demand for the plaintiffs' books or that a developed AI-training licensing market existed that Meta had usurped, and Judge Chhabria warned in unusually direct terms that a richer record on market harm—particularly evidence of "market dilution," the flooding of the market with AI-generated competing works—might tip future cases the other way. In other words, Meta won this case but may have written the roadmap for plaintiffs to win the next one.
Bartz v. Anthropic PBC drew the sharpest and most consequential distinction of all. Judge William Alsup held in June 2025 that training Claude on copyrighted books was fair use because generative AI is "quintessentially transformative"—but he carved off Anthropic's acquisition of millions of pirated books from shadow libraries as a separate act that the eventual transformative use could not justify. Downloading and permanently keeping pirated copies, he held, was not fair use merely because some of those books would later feed transformative training; the piracy was a distinct wrong. That ruling carried staggering stakes. Anthropic agreed to a landmark $1.5 billion settlement, covering roughly 500,000 works at about $3,000 each, and agreed to destroy the pirated datasets. As of mid-2026 the settlement has received preliminary approval (granted in 2025) and the claims process has proceeded, with the deal not licensing future use or releasing claims about infringing outputs. (Anthropic is the maker of Claude; this article is published with no involvement from that company, and its inclusion here is purely as a matter of public legal record.) The throughline across Bartz, Kadrey, and the Copyright Office report is unmistakable: training may well be transformative, but how the data was acquired matters independently of how it is used. Obtaining content through piracy poisons the fair-use defense regardless of the eventual purpose. For Atlas, that principle is a bright warning light: even a defensible end-use cannot cleanse data obtained through illegitimate channels.
Two more suits round out the picture and show the same fights playing out across media. In Andersen v. Stability AI Ltd. (N.D. Cal.), visual artists sued the makers of image generators trained on billions of images scraped into the LAION datasets. In a 2024 ruling, the court allowed the artists' direct copyright infringement claim to proceed into discovery insofar as it rested on the unauthorized copying of "Training Images," while dismissing the more aggressive theory that the model itself is a "derivative work" containing compressed copies of every training image, and dismissing certain DMCA copyright-management-information claims. The case confirms that the scraping-and-training copying is a cognizable infringement to be tested against fair use, even where the "model-is-a-copy" theory fails. And in The New York Times Co. v. Microsoft Corp. & OpenAI, the Times alleges both that training on its articles infringed and—more pointedly—that the models can be prompted to regurgitate near-verbatim passages from its paywalled journalism, which goes directly to fair use's first and fourth factors: a model that reproduces its inputs is harder to call transformative and easier to call a market substitute. The Times litigation, with its evidence of memorization and output overlap, may prove more dangerous to AI defendants than the book cases, precisely because it attacks the "we only extract patterns, we never reproduce" defense at its root. We track these copyright fights in depth in copyright infringement claims against generative AI, and the platform-liability dimension—who answers when user-generated or model-generated content infringes—in Section 230 reform and platform liability for user-generated IP infringement.
Technical Barriers, robots.txt, and the DMCA
Because the law now turns so heavily on whether there was a "gate," the legal significance of technical barriers has become central—and the barriers come in a hierarchy.
At the bottom sits robots.txt, the decades-old protocol that lets a site signal which areas automated programs should avoid. Reputable crawlers honor it by convention, but it is advisory, not an access-control mechanism—the digital equivalent of a "please keep off the grass" sign, not a fence. A robots.txt line saying "don't scrape this" is not a password gate, and courts have generally declined to treat robots.txt violations alone as CFAA breaches. That said, robots.txt is not legally weightless. Scholarship in Computer Law & Security Review and elsewhere has argued it can matter under contract and tort theories—potentially forming a unilateral contract (continued access in exchange for compliance) or, more plausibly, supplying the notice element for a trespass or breach theory once a scraper has read it and proceeded anyway. A scraper that reads "no bots" and keeps scraping has, at minimum, handed its adversary an exhibit on the question of bad faith.
In the middle and at the top sit the real gates. CAPTCHAs, login requirements, rate limiting, IP blocking, and device fingerprinting are stronger signals that access is not freely permitted, and the hiQ, Power Ventures, and Bright Data cases consistently treat content behind authentication as far better protected than open content. A scraper that bypasses CAPTCHAs, creates fake accounts, rotates residential proxies to defeat IP blocks, or circumvents authentication faces dramatically greater exposure than one collecting only freely accessible material—not just under the CFAA (now there is a gate to go around) but under the next statute in the toolkit.
Circumvention can trigger a distinct cause of action: the DMCA's anti-circumvention provision, Section 1201 (17 U.S.C. § 1201), which prohibits circumventing technological measures that "effectively control access" to copyrighted works, and trafficking in tools designed to do so. Password systems and encryption clearly qualify; whether anti-bot measures like CAPTCHAs and bot-detection systems are "technological protection measures" within § 1201 remains unsettled and is one of the live frontier questions in scraping law. The question is being litigated: in late 2025, Google sued SerpApi, a company that scrapes Google search results, alleging that it circumvented Google's anti-scraping defenses in violation of § 1201—a novel application of the DMCA to scraping that, if it proceeds to judgment, could establish important precedent on whether defeating bot-detection is "circumvention" at all. Because § 1201 statutory damages run from $200 to $2,500 per act of circumvention, large-scale scraping that defeats access controls could, in theory, generate enormous liability—every defeated CAPTCHA a separate violation.
Section 1201 also intersects with the broader copyright-enforcement toolkit operators already use. A site whose content is copied and republished can pursue takedowns through the notice-and-takedown regime, which we cover in how to file a DMCA takedown notice and respond to one; the anti-circumvention provisions of § 1201 sit alongside that regime as a separate cause of action aimed not at the copying itself but at the defeat of the technical measures guarding the content. For an operator deciding how to respond to a persistent scraper, the practical question is which tool fits the facts: a takedown notice addresses copies that have been republished, a § 1201 claim addresses the circumvention of access controls, a contract claim addresses a scraper who accepted terms, and a trespass claim addresses harm to the servers. The theories are cumulative, and a well-advised operator selects among them based on what the scraper actually did—rather than reaching reflexively for the CFAA, which, after Van Buren and hiQ, is often the weakest available theory against a scraper of public data.
Common-Law Backstops: Trespass to Chattels and Misappropriation
When the statutory claims fail, website operators turn to common-law theories, and the hiQ settlement's stipulations to trespass to chattels and misappropriation under California law show these remain viable backstops rather than museum pieces.
Trespass to chattels treats interference with a site's servers as analogous to interference with physical personal property—a tort with medieval roots repurposed for the cloud. The doctrine had a brief, dramatic run in the early internet era: in eBay, Inc. v. Bidder's Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000), a court enjoined an aggregator's scraping of eBay auctions on a trespass theory, reasoning that the unauthorized queries consumed server capacity. But the California Supreme Court reined the doctrine in sharply in Intel Corp. v. Hamidi, 71 P.3d 296 (Cal. 2003), holding that trespass to chattels requires actual harm to the system or its functioning—mere unauthorized contact, or consumption of resources without measurable impairment, does not suffice. The practical upshot: aggressive scraping that slows a site, exhausts bandwidth, drives up cloud bills, or causes outages presents a real trespass claim, while light-touch, well-rate-limited scraping of public pages usually does not. This is why rate limiting is simultaneously the scraper's best defense (it avoids creating the harm that makes trespass actionable) and the operator's best evidence (logs showing degradation supply the missing element). It is also a sleeper reason to scrape gently rather than greedily: a polite bot is a lawful bot far more often than an impatient one.
Misappropriation, descended from the Supreme Court's 1918 "hot news" decision in International News Service v. Associated Press, 248 U.S. 215 (1918), occasionally protects time-sensitive factual content—stock quotes, breaking news, sports scores—against free-riding competitors even though the underlying facts are not copyrightable under Feist. The doctrine is narrow, varies by state, and the Second Circuit cabined it tightly in NBA v. Motorola, Inc., 105 F.3d 841 (2d Cir. 1997), which preserved a hot-news claim only for a thin set of cases involving time-sensitive information, free-riding, and a direct threat to the plaintiff's incentive to produce the information. Outside that narrow channel, federal copyright law often preempts state misappropriation claims, so operators cannot use misappropriation as an all-purpose end-run around Feist. Unjust enrichment claims face the structural difficulty that public content is, by definition, freely accessible—it is hard to call it "unjust" to take what was offered to the world—but such claims can add settlement pressure when pleaded alongside stronger theories. And where the scraped content includes confidential or proprietary material rather than public pages, an operator may also reach for trade secret misappropriation, a theory we develop in building a trade secret protection program from scratch and protection of trade secrets.
Practical Guidance: For Data Collectors
The case law translates into a usable risk spectrum, and Atlas can keep itself toward the safe end by internalizing a handful of rules. The single most important factor is the public-versus-restricted line: scrape only data that is genuinely public—available to any visitor without logging in—because content behind authentication carries far higher risk under every doctrine at once. Never use fake accounts; doing so transformed hiQ's defensible operation into a clear breach of contract and helped sink it, and it can supply the gate that brings the CFAA back into play. Respect technical barriers: while robots.txt alone may not bind, treating barriers as meaningful demonstrates good faith, supports a fair-use posture, and avoids the bad-faith exhibits that make any eventual case worse—and one should never circumvent CAPTCHAs, defeat IP blocks, rotate residential proxies to evade detection, or bypass authentication, all of which invite CFAA and DMCA § 1201 exposure. Read the terms of service carefully and decide deliberately whether to register, because the Bright Data outcome turned on what the terms actually prohibited and to whom they applied—but remember that the moment Atlas creates an account, it has likely agreed to terms that ban scraping, converting a logged-out gray area into a logged-in breach. Implement aggressive rate limiting, since scraping that degrades a site creates trespass exposure under Intel v. Hamidi and can satisfy the CFAA's damage provisions. Prefer facts over expression: collecting prices, specifications, and other Feist-unprotected data is far safer than reproducing articles, photographs, reviews, or code. Document legitimate, ideally transformative, purposes, because fair use rewards research and genuine transformation over naked commercial substitution. Scrutinize the source of content, since Bartz makes clear that acquiring data through piracy independently undermines fair use—so never scrape pirate sites or obtain content through obviously illegitimate channels, no matter how transformative the planned use. And given how fast the law is moving, maintain ongoing legal review of target selection, methods, jurisdictions, and data use; a compliance posture that was safe in January may not be in June. For collectors operating internationally, our note on international data transfers after Schrems II and the discussion of foreign text-and-data-mining regimes below are essential reading, as is biometric data privacy laws and their impact on AI development wherever the scraped data includes faces, voices, or other biometrics.
Practical Guidance: For Website Operators
Operators protecting their content should layer technical, contractual, and enforcement measures, because each individual measure has a known weakness that another covers. The most effective single legal protection is to place valuable content behind authentication, because the Van Buren framework consistently affords gated content far stronger protection than open content—the gate is what makes the CFAA, and a clean trespass and contract theory, available. Draft terms of service that actually bind the people you mean to reach: the Bright Data cases exposed the fatal weakness of terms that restrict only "users," so consider terms that clearly apply to all visitors, that are presented as clickwrap requiring affirmative acceptance rather than buried browsewrap, and that specifically prohibit automated collection, scraping, and the use of the content for AI training. Remember Meta's self-inflicted wound: it had once written "by accessing Facebook you agree" and removed it. Words in terms of service are evidence; draft them as though a judge will parse every one. Publish a comprehensive robots.txt, which—though not independently enforceable—establishes notice and may support contract or tort theories and increasingly serves as the locus of "AI opt-out" signals. Deploy technical anti-bot measures—CAPTCHAs, rate limiting, device fingerprinting, behavioral analysis—which deter casual scrapers, create the evidence of circumvention that a § 1201 or CFAA claim needs, and establish the very "gates" the post-Van Buren framework requires. Monitor for scraping, because rights cannot be enforced against violations that go undetected, and because logs of server degradation supply the actual harm element that Intel v. Hamidi demands of a trespass claim. Send cease-and-desist letters strategically, recognizing that—as hiQ, Power Ventures, and Ticketmaster together show—such letters do not convert public access into a CFAA violation by themselves, but a well-drafted letter that expressly revokes access and deploys an IP block can do real work, and at minimum establishes the notice that makes browsewrap and trespass theories viable. And register copyrights in valuable expressive content, since registration is a precondition to suit and unlocks statutory damages and attorneys' fees (our how to register a copyright with the U.S. Copyright Office and copyright registration of websites and website content guides walk through the mechanics). Keep in mind Feist's ceiling, though: registration protects original expression, not the underlying facts, so a site whose value is mostly factual data should lean on contract and technical measures rather than expecting copyright to do the heavy lifting.
A final strategic point ties these tactics together: the strongest protection is layered, because each individual measure has a predictable gap. Authentication defeats the CFAA problem for outside scrapers but does nothing about a logged-in insider; a clickwrap contract binds registered users but not anonymous bots; robots.txt provides notice but no enforcement; anti-bot technology deters but can be circumvented; copyright registration enables strong remedies but only for protectable expression, not facts. An operator that relies on a single mechanism leaves a predictable hole. An operator that combines authentication, a properly formed clickwrap agreement, a published robots.txt, technical anti-bot measures, monitoring, and copyright registration forces a would-be scraper to overcome several independent barriers at once—and in doing so generates, almost as a byproduct, the evidence of circumvention, breach, and bad faith that makes any eventual enforcement action far stronger. The same layering logic that protects trade secrets and software protects web content: no single wall is impregnable, but several walls together are what actually hold. For operators worried about hostile collection during a breach rather than ordinary scraping, cybersecurity incident response and IP protection addresses the harder case.
Looking Ahead: AI Licensing, Legislation, and the Global Patchwork
The framework remains in flux, and several forces will shape it. The wave of AI-training copyright suits will set precedent on fair use that could validate or upend the practices fueling the current AI boom—and the Third Circuit's forthcoming decision in Thomson Reuters v. ROSS, as the first appellate word on AI-training fair use, looms especially large; it could either endorse the developer-friendly transformativeness reasoning of Bartz and Kadrey or entrench the ROSS trial court's market-harm skepticism. Legislative developments could create specific rules for AI training data, either authorizing broad scraping (perhaps with statutory compensation, on the model of compulsory music licenses) or imposing new licensing or disclosure requirements; bills to require AI developers to disclose their training data have circulated in several states and Congress. The Copyright Office's 2025 report, by rejecting both blanket authorization and blanket prohibition, points courts toward fact-intensive analysis rather than bright lines—which is good for justice and bad for predictability.
Industry practice is itself shifting the legal calculus in a way that should worry developers who plan to rely on fair use. As major publishers, news organizations, image libraries, and book publishers increasingly license content for AI training—and as AI developers sign nine- and ten-figure deals to do so—the existence of a robust licensing market makes it progressively harder for any developer to argue that no such market exists. That argument was the linchpin of Kadrey's fourth-factor analysis, and every new licensing deal chips away at it. The market that Meta said did not yet exist is being built in real time, and the fourth factor moves with it.
For a company like Atlas operating across borders, the international patchwork is not an abstraction but a planning constraint. The European Union's Digital Single Market Directive created text-and-data-mining exceptions that permit certain copying for analysis—broad for research institutions, narrower for commercial actors, and subject to an "opt-out" mechanism that lets rightsholders reserve their works from commercial mining, which sophisticated publishers increasingly invoke (and which the EU AI Act reinforces). Japan has gone further toward permissiveness, with statutory provisions widely read to allow copying for information analysis including AI training, making it, for a time, an unusually attractive jurisdiction for model development. The United States, by contrast, channels the same questions through the open-ended, unpredictable fair-use doctrine, where the answer depends on a four-factor balance that no one can call in advance. The consequence is that the same training run can be lawful in one jurisdiction and infringing in another, and a model trained on data collected under a permissive foreign regime may still face infringement exposure when deployed to users in a stricter one. Atlas cannot assume that compliance in its home jurisdiction travels with the data; it has to map the rules of every place where it collects and every place where it deploys. The divergence among regimes is itself a reason to favor licensed, clearly sourced data over scraped content of uncertain provenance—the provenance problem follows the data everywhere it goes.
Conclusion
The web-scraping landscape has reached partial clarity atop persistent uncertainty. The CFAA does not bar scraping publicly accessible data from sites that require no authentication—Van Buren and hiQ establish that the statute needs a technological gate, not a policy. But the CFAA is not the only source of liability, and after hiQ it is often the weakest. Contract claims can succeed where a scraper agreed to terms by creating an account or clicking through, and the clickwrap/browsewrap line decides whether assent ever formed. Copyright sorts scraped content into two worlds: the Feist-free zone of facts, prices, and data, which a scraper may take freely, and the contested zone of expressive works, where fair use is a real but unsettled defense and the manner of acquisition—piracy versus lawful access—can be decisive regardless of how transformative the end. Common-law torts like trespass to chattels fill the gaps where scraping harms a site's systems, but only on proof of actual harm. And technical measures matter legally as well as practically: content behind authentication enjoys stronger protection, and circumventing barriers can stack DMCA § 1201 liability on top of CFAA and contract exposure.
For everyone in this space, the organizing insight is that risk runs along a spectrum, and a handful of facts move a scraper along it. Atlas scraping truly public data, logged out, gently rate-limited, taking facts rather than expression, respecting technical barriers, and never forming a contract sits at the low-risk end. The moment it logs in, fakes an account, circumvents a barrier, copies expressive content wholesale, or sources from pirate channels, it slides sharply toward the high-risk end—and into the territory where hiQ, ROSS, and Anthropic each learned, expensively, that the law had more teeth than they assumed. The cases of the past few years have settled important principles, but the final chapter—above all on AI-training fair use, now pending before the Third Circuit—remains unwritten, which makes close attention and adaptable compliance the only durable strategy.
For assistance with data-collection compliance, terms-of-service drafting, AI-training data strategy, or defending against a scraping claim, contact our intellectual property and technology practice.
Frequently Asked Questions
Is web scraping legal? There is no single answer, which is the whole point of this article. Scraping genuinely public data while logged out, without circumventing any technical barriers and without forming a contract, is generally lawful under the CFAA after Van Buren and hiQ—but the same conduct can still breach an enforceable contract, infringe copyright in expressive content, or commit a trespass if it harms the site's servers. The legality of any given scrape depends on what you collect, whose site it is, how you get in, and what you do with the data afterward.
Does violating a website's terms of service make scraping a federal crime? No, not by itself. Van Buren held that the CFAA's "exceeds authorized access" clause is about going around a technological gate, not about violating a use policy, and hiQ extended the logic to "without authorization" for public sites. A bare terms-of-service violation is therefore not a CFAA crime. But it may be a breach of contract—if you actually agreed to the terms—and that is exactly how hiQ ultimately lost.
Can a website bind me to its terms of service if I never created an account? Usually not. The Meta v. Bright Data and X Corp. v. Bright Data decisions held that terms restricting "users" do not bind anonymous, logged-off scrapers who never registered, because there was no assent and the scraper was not the kind of "user" the terms addressed. Browsewrap terms presented in a footer link, with no affirmative acceptance, are frequently unenforceable for the same reason (see Nguyen v. Barnes & Noble). The picture changes the instant you register an account and click "I agree"—a clickwrap agreement—at which point you are very likely bound by whatever it says, including a no-scraping clause.
Is the data I scrape protected by copyright? Often not. Under Feist v. Rural Telephone, facts are not copyrightable—prices, addresses, scores, specifications, and similar data are free to take. A database can have a "thin" copyright in its original selection and arrangement, but not in the underlying facts. What is protected is original expression: articles, photographs, reviews, code, and books. Copy those wholesale and you need a fair-use defense.
Is using scraped data to train an AI model fair use? Sometimes—it is genuinely unsettled. Courts have found AI training "transformative" in Kadrey v. Meta and Bartz v. Anthropic, but rejected the defense in the Thomson Reuters v. ROSS trial court, and that case is now on appeal to the Third Circuit. Two themes recur: a model that can reproduce its training data in outputs (as the New York Times alleges of OpenAI) is harder to call transformative, and the growth of an AI-training licensing market increasingly cuts against fair use on the fourth factor. The U.S. Copyright Office's 2025 report frames the question as fact-specific, which is also how the courts are treating it.
Does it matter how I obtained the training data? Yes—decisively. Bartz v. Anthropic held that downloading and keeping pirated books was not fair use even though the eventual training use was transformative, and Anthropic settled for $1.5 billion over the piracy. The manner of acquisition is analyzed independently of the end-use: a transformative purpose does not launder data obtained through illegitimate channels. Lawful provenance is not a technicality; it is often the case.
Can I scrape data that sits behind a login if I create an account to get in? This is the high-risk move. Creating an account almost always means accepting terms of service (a likely breach of contract if they ban scraping), and logging in puts the content behind a "gate," which brings the CFAA back into play and can support a trespass claim. Using fake accounts, as hiQ did, is worse still and supplies powerful evidence of bad faith. The safest posture is to collect only what is visible without logging in.
What can a website operator actually do to stop scrapers? Layer the defenses, because each has a gap. Put valuable content behind authentication; use clickwrap terms that bind all visitors and ban automated collection and AI training; publish a robots.txt; deploy CAPTCHAs, rate limiting, and bot detection; monitor server logs for harmful load (which supplies the "actual harm" element a trespass claim needs under Intel v. Hamidi); send well-drafted cease-and-desist letters that expressly revoke access; and register copyrights in original expressive content to unlock statutory damages. No single measure is sufficient, but together they force a scraper over several independent barriers and build the record for any eventual suit.
Does robots.txt have any legal force? Not on its own—it is advisory, and ignoring it is not by itself a CFAA violation. But it is not worthless: it can establish notice, which supports contract and trespass theories, and it is increasingly where sites place "no AI training" opt-out signals. A scraper that reads "no bots" and proceeds anyway hands its adversary an exhibit on bad faith.
Do foreign laws change the analysis? Significantly. The EU's text-and-data-mining exceptions (with a rightsholder opt-out) and Japan's permissive information-analysis provisions treat training-related copying more leniently than U.S. fair use does in many cases. But the same training run can be lawful in one country and infringing in another, and deploying a model to users in a stricter jurisdiction can create exposure there regardless of where the data was collected. Cross-border operations have to map both collection and deployment jurisdictions.
Related Articles
- Copyright infringement claims against generative AI: the New York Times, Getty, and what comes next
- Section 230 reform and platform liability for user-generated IP infringement
- Capturing the web: a practitioner's guide to authenticating website screenshots as evidence in federal court
- Biometric data privacy laws and their impact on AI development
- Artificial intelligence key legal issues: an overview
- How to file a DMCA takedown notice and respond to one
- Legal protection of software: copyrights, patents, trade secrets, and contracts
- Building a trade secret protection program from scratch
- International data transfers after Schrems II: standard contractual clauses and transfer impact assessments
- Copyright FAQs: answers to common copyright questions
Selected Authorities
Computer Fraud and Abuse Act, 18 U.S.C. § 1030; Copyright Act, 17 U.S.C. §§ 101, 102, 103, 107; Digital Millennium Copyright Act, 17 U.S.C. § 1201. Van Buren v. United States, 593 U.S. 374 (2021); hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019), vacated and remanded, 141 S. Ct. 2752 (2021), reaffirmed, 31 F.4th 1180 (9th Cir. 2022); Facebook, Inc. v. Power Ventures, Inc., 844 F.3d 1058 (9th Cir. 2016); Craigslist Inc. v. 3Taps Inc., 964 F. Supp. 2d 1178 (N.D. Cal. 2013); Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991); Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994); Authors Guild, Inc. v. Google, Inc., 804 F.3d 202 (2d Cir. 2015); Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146 (9th Cir. 2007); Nguyen v. Barnes & Noble Inc., 763 F.3d 1171 (9th Cir. 2014); Meta Platforms, Inc. v. Bright Data Ltd. (N.D. Cal. 2024); X Corp. v. Bright Data Ltd. (Cal. & N.D. Cal. 2024); Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence Inc. (D. Del. 2025; on interlocutory appeal, 3d Cir., argued June 11, 2026); Kadrey v. Meta Platforms Inc. (N.D. Cal. 2025); Bartz v. Anthropic PBC (N.D. Cal. 2025; settlement preliminarily approved); Andersen v. Stability AI Ltd. (N.D. Cal.); The New York Times Co. v. Microsoft Corp. (S.D.N.Y.); Intel Corp. v. Hamidi, 71 P.3d 296 (Cal. 2003); eBay, Inc. v. Bidder's Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000); International News Service v. Associated Press, 248 U.S. 215 (1918); NBA v. Motorola, Inc., 105 F.3d 841 (2d Cir. 1997). U.S. Copyright Office, Copyright and Artificial Intelligence, Part 3: Generative AI Training (2025).
This article is for general informational purposes only and does not constitute legal advice, nor does it create an attorney-client relationship. The law governing web scraping and AI-training data is evolving rapidly, several key cases remain on appeal or pending final approval, and outcomes are highly fact-specific; consult qualified counsel about your particular situation before acting.