A number that can win or lose a case
Picture two trademark cases, identical in every legally relevant respect. Same marks, same goods, same shelf, same shoppers. In the first, the plaintiff's survey expert takes the stand and reports that 28% of relevant consumers, net of a control, were confused about who made the defendant's product. The jury believes her. In the second, the defendant's expert reports that confusion was a statistically trivial 3%, and the judge tosses the plaintiff's competing survey out of the case entirely before trial. Same dispute, opposite outcomes.
The difference is not the consumers. It is the methodology—the chain of design choices that begins with whom you ask and runs through what you show them, how you ask, what you compare against, and how you count the answers. A consumer survey is, at bottom, a scientific instrument pointed at a question the law cannot otherwise answer: are ordinary buyers likely to be confused about the source of a product? There is no way to peer directly into the collective mind of the buying public, so the law accepts the next best thing—a controlled measurement of a representative slice of it. Built well, that instrument produces a defensible number: this share of the relevant public was confused, net of noise. Built badly, it is a funhouse mirror that reflects back whatever the surveyor hoped to see.
This guide is a practical tour of how that instrument is engineered and read, written for litigators, in-house counsel, and the business owners who pay for the survey and live with the result. It is the design-and-reading companion to our article on Daubert challenges to consumer survey experts, which focuses on the admissibility gateway. Here we focus on the instrument itself—how a sound survey is constructed and interpreted, and exactly where unsound ones go wrong. Surveys feed directly into the substantive confusion analysis we examine in our article on the Polaroid factors on summary judgment in the Second Circuit, and they show up in dilution, secondary-meaning, and genericness fights as well. By the end you will be able to read a survey report the way a seasoned cross-examiner does: in the order that the design choices actually matter.
The authorities we lean on throughout are the ones courts and experts actually use. The Federal Judicial Center's Reference Manual on Scientific Evidence contains a Reference Guide on Survey Research—authored in its leading editions by Professor Shari Seidman Diamond—that sets out the accepted criteria for a trustworthy survey, and federal courts cite it constantly. McCarthy on Trademarks and Unfair Competition (§§ 32:158 et seq.) is the standard treatise on the subject. Practitioner bodies such as the International Trademark Association (INTA) publish survey guidelines, and the case law—Union Carbide Corp. v. Ever-Ready, Inc., 531 F.2d 366 (7th Cir. 1976), SquirtCo v. Seven-Up Co., 628 F.2d 1086 (8th Cir. 1980), and their progeny—supplies the working vocabulary. Measure any survey against these, and its strengths and weaknesses come into sharp focus.
The foundational choice: defining the universe
Every survey begins with a single decision that, more than any other, determines whether it measures anything the law cares about: the definition of the universe. The universe is the population whose perceptions are legally relevant—the group whose confusion, if it exists, establishes a likelihood of confusion. Get the universe wrong and nothing downstream can save the survey, because you measured the wrong people. As the Reference Guide on Survey Research puts it, the first question to ask of any survey is whether it sampled the appropriate population; an elegant instrument trained on the wrong crowd is worthless.
The correct universe depends on the theory of confusion, and getting that theory straight is a prerequisite to everything else. In the ordinary case of forward confusion—where consumers might mistakenly believe the junior user's goods come from the senior user—the relevant universe is the prospective purchasers of the junior user's (the defendant's) goods or services. The legal worry is whether those buyers, encountering the junior mark, will be confused about source, so it is their state of mind that must be measured. Surveying the senior user's existing customers, or the general public when the product is specialized, measures the wrong thing entirely. In a reverse confusion case—where a larger junior user saturates the market and consumers come to believe the smaller senior user's goods come from the junior user—the universe flips to the prospective purchasers of the senior user's goods. (We unpack that mirror-image theory in Reverse confusion in trademark law; the survey universe is one of the places where reverse confusion most often trips up litigants who fail to notice that they have a reverse-confusion case.)
Two recurring errors deserve names because experts on both sides hunt for them. An over-inclusive universe sweeps in people who are not relevant purchasers—diluting the sample with respondents whose perceptions do not bear on the legal question and typically distorting the result. An under-inclusive universe excludes relevant purchasers, producing a sample that cannot fairly be generalized back to the population the law cares about. Both errors corrode probative value, and the Second Circuit has repeatedly stressed that defining the proper universe is fundamental. See Bristol-Myers Squibb Co. v. McNeil-P.P.C., Inc., 973 F.2d 1033 (2d Cir. 1992) (treating the proper universe as central to a survey's probative value in a dispute over the analgesic brands EXCEDRIN PM and TYLENOL PM). District courts in the Circuit have discounted or rejected surveys built on an improperly defined universe. See Hutchinson v. Essence Communications, Inc., 769 F. Supp. 541 (S.D.N.Y. 1991) (addressing the consequences of a flawed survey universe).
Here is why a universe error is so often fatal rather than merely damaging: it cannot be cured by cross-examination or argument. A survey that asked the wrong people simply has no probative answer to give, no matter how graceful the rest of its design. You cannot argue your way back to relevance. Contrast that with, say, a slightly leading question or an arguably small sample—defects a court may treat as going to weight, leaving the jury to discount the result. The universe is the load-bearing wall; the rest is finish carpentry.
The universe is operationalized through screening (filter) questions at the start of the survey, which qualify each respondent as a member of the relevant population—confirming, for example, that the respondent has bought or is likely to buy products of the relevant kind within a relevant timeframe and price range. Well-drafted screeners are what translate an abstract universe definition into an actual, defensible sample. Poorly drafted screeners are a back door through which the universe problem walks right back in: a screener that is too loose admits the wrong respondents; one that is too tight quietly excludes the right ones. A favorite cross-examination move is to show that the screeners did not actually capture the universe the expert claims to have sampled—that the "prospective purchasers of premium running shoes" were, in fact, anyone who had bought any footwear in the past year.
One persistent confusion is worth clearing up at the outset, because it surfaces in nearly every survey fight: the format does not change the universe. Whether an expert uses the Eveready format or the Squirt format (both described below), a forward-confusion survey must still sample prospective purchasers of the junior user's goods. The format governs how respondents are shown the marks and what they are asked—not who should be asked. A Squirt survey run on the wrong universe is exactly as defective as an Eveready survey run on the wrong universe. The universe is logically prior to, and entirely independent of, the format choice. Treating it otherwise is one of the most common analytical errors litigants make.
The two confusion-survey formats
Trademark confusion surveys come in two principal formats, each suited to different circumstances, and choosing the right one for the mark and market at issue is itself a core methodological decision. Using the wrong format is a recognized vulnerability that a competent opponent will exploit.
The Eveready format: for strong, famous marks
The Eveready format takes its name from Union Carbide Corp. v. Ever-Ready, Inc., 531 F.2d 366 (7th Cir. 1976), where Union Carbide—owner of the famous EVEREADY battery brand—used a survey to show confusion with a junior user's "Ever-Ready" lamps. (The Seventh Circuit reversed a judgment for the defendant in part on the strength of that survey, and the format has carried its name ever since.) Its defining feature is what it withholds: respondents are shown only the junior user's product or mark—never the senior mark—and then asked open-ended questions. Who do you think puts out this product? What makes you say that? What other products do you think come from the same source? Do you think the company needs anyone's permission or approval to make this?
Because it never displays the senior mark, the Eveready format relies entirely on the respondent's pre-existing knowledge. If a meaningful share of respondents, shown only the junior product, spontaneously name or describe the senior user as the source, that is strong, non-leading evidence that the junior mark genuinely calls the senior to mind in the real world. The format's great virtue is that it minimizes suggestion: it does not put the answer in front of the respondent and wait for a nod. For that reason its results are highly credible, and many commentators and courts regard Eveready as the gold standard for confusion surveys.
Its limitation is the flip side of its virtue. Eveready works only when the senior mark is famous enough that consumers carry it around in their heads. For an obscure senior mark, respondents simply will not spontaneously name it—not because confusion is unlikely, but because they have never heard of the plaintiff. Run an Eveready survey on a little-known mark and you will measure something close to zero, even where real-world confusion is genuinely likely. That mismatch is a methodological own-goal, and a careful expert avoids it.
The Squirt format: for weaker marks or marks seen together
The Squirt format, named for SquirtCo v. Seven-Up Co., 628 F.2d 1086 (8th Cir. 1980)—a dispute between the soft-drink marks SQUIRT and QUIRST—is built for the opposite situation: less-famous marks, or markets where consumers genuinely encounter both marks together. Its defining feature is that it shows respondents both marks, senior and junior, typically in a sequence or array, and then asks whether the respondent believes the products come from the same source or are affiliated, connected, sponsored, or approved.
The Squirt format is the right tool when the senior mark is not strong enough for respondents to summon from memory, or when the real marketplace actually presents the two marks in proximity—products advertised in the same channels, displayed on the same shelf, or sold side by side. In those settings, refusing to show the senior mark would not replicate the marketplace; it would distort it.
But the Squirt format carries a built-in hazard: suggestion. The moment you place two marks in front of a respondent and ask whether they might be related, you have nudged the respondent to go looking for a relationship—and people are exquisitely good at finding patterns when invited to. A Squirt survey can manufacture confusion by juxtaposition that no shopper would ever experience in the wild. The standard countermeasures are (1) building a lineup or array that mixes the test marks with other, unrelated marks so the comparison is not a naked two-way face-off, (2) writing a scrupulously neutral same-source question, and (3)—most important—running a robust control (more on this below). Courts scrutinize Squirt surveys precisely for whether the juxtaposition produced the confusion it claims to have found. Courts in the Eastern District of New York have examined Squirt-format surveys with exactly this concern. See Cumberland Packing Corp. v. Monsanto Co., 32 F. Supp. 2d 561 (E.D.N.Y. 1999) (scrutinizing a Squirt-format survey's design and universe in a trade-dress dispute over the blue SWEET'N LOW packaging and a competing sweetener).
Choosing between them
The choice between formats is a function of the senior mark's strength and the conditions of the marketplace. A strong, recognizable senior mark generally calls for Eveready, whose memory-reliant design yields the most credible results and sidesteps the suggestion problem. A weaker senior mark, or a market in which consumers really do meet both marks together, may call for Squirt—but only with the heightened controls that format demands. The mismatches are the traps: Eveready for a mark too obscure to be remembered will under-count confusion, and Squirt deployed against a famous mark in a needlessly suggestive way will over-count it. Either way, the opposing expert will pounce, and the careful expert chooses the format the way a golfer chooses a club—by the lie of the ball, not by habit.
A practical note for litigators reading the other side's report: the format choice is often the first thing worth interrogating. If the plaintiff's mark is plainly famous and its expert nonetheless chose Squirt, ask why the more conservative Eveready approach was passed over—the answer is sometimes that Eveready showed too little confusion to be useful.
Controls and net confusion: the most important number
If the universe is the most important design choice, the control is the most important interpretive one, because without it a reported confusion percentage is essentially uninterpretable. A control isolates the confusion attributable to the challenged element from the background noise that infects every survey: respondents guessing, respondents importing pre-existing assumptions, and artifacts of the survey instrument itself. Some baseline level of "confusion" appears in any survey, even when there is nothing to be confused about—people answer questions, and some of those answers will, by chance or habit, look like confusion.
A control works by running a second cell identical to the test cell in every respect except the allegedly infringing feature. If the test cell shows respondents the junior product bearing the challenged mark, the control cell shows them an otherwise-identical product bearing a different, plainly non-infringing mark (or with the challenged feature altered or removed). Whatever "confusion" the control cell registers is, by construction, noise—because there is nothing infringing there for respondents to be confused by. The legally meaningful figure is therefore the net confusion: the confusion rate in the test cell minus the confusion rate in the control cell.
The Reference Guide on Survey Research treats a well-designed control as the single most powerful tool for distinguishing genuine confusion from background noise, and McCarthy is to the same effect. A survey that reports a raw test-cell figure with no control overstates confusion—sometimes dramatically—because it counts noise as if it were genuine confusion caused by the defendant's mark.
Consider a worked illustration. Suppose a test cell reports 33% confusion. Impressive, until the control cell—identical product, made-up non-infringing name—reports 18%. The net confusion is 33% − 18% = 15%, a respectable but far more modest figure. The 18% was always going to appear no matter what name sat on the package; it tells you how much "confusion" the survey instrument itself generates. Now imagine the same survey with no control at all: the report would trumpet 33%, and a careless reader would believe it. This is why the first question a sophisticated reader asks of any confusion survey is not "what was the confusion rate?" but "what was the control, and what is the net?"
Controls matter for both formats, but they are especially indispensable for Squirt surveys, where the juxtaposition of the two marks tends to inflate raw confusion. A Squirt survey reporting, say, 35% raw confusion may collapse to a far smaller—or even negligible—net figure once a proper control reveals how much of that 35% was juxtaposition-driven noise. A Squirt survey without a well-designed control is therefore one of the most exploitable artifacts in trademark litigation; an experienced cross-examiner can sometimes reduce its headline number to rubble by simply asking what the result would have been against a control that the expert never built.
Question design, demand effects, and marketplace realism
Beyond universe and controls, a survey's quality turns on the questions and the stimulus—how respondents are asked, and what they are shown.
Open-ended, non-leading questions
Sound question design favors open-ended, non-leading questions that elicit a respondent's genuine perception without supplying the answer. "Who do you think puts out this product?" invites a spontaneous response. "Do you think this product comes from the same company that makes [senior brand]?" supplies the connection and all but begs for a yes. Leading questions and demand effects—any cue, verbal or contextual, that telegraphs the desired answer—are among the most common and most damaging defects, because they generate confusion that exists only in the instrument, not in the marketplace. A survey is supposed to be a measurement, not a coaching session.
Good surveys also offer respondents a genuine "don't know / no opinion" option, so that honest uncertainty is not forced into a guess that later gets coded as confusion. They rotate the order of stimuli and answer choices to neutralize order effects (people disproportionately pick the first or last item). And they probe spontaneous answers neutrally—"What makes you say that?"—so that a respondent who names the senior brand for an irrelevant reason (because both products are green, say) is not miscounted as confused about source.
Double-blind administration
A survey should be administered double-blind: neither the interviewer nor the respondent knows the survey's sponsor or its purpose. Single-blind is not enough. An interviewer who knows the "right" answer can, without any conscious bad faith, signal it through tone, pacing, or follow-up; a respondent who guesses the sponsor's identity may try to be helpful. Double-blind administration severs both channels of bias and is a hallmark of a serious survey. The absence of double-blind controls is a recurring point of attack, and rightly so—it is cheap to do correctly and conspicuous when skipped.
Marketplace realism
Equally important is marketplace realism: the stimulus shown to respondents should resemble how consumers actually encounter the marks. Real buyers see products with their full trade dress—house brands, logos, packaging, color, price, and point-of-sale context—not bare word marks floating on a white screen. A survey that strips away that context, or that stages an artificial side-by-side comparison no shopper would ever see, measures something other than real-world confusion and invites the criticism that it manufactured its result. The principle has special bite in trade dress cases, where the look and feel of packaging is the asserted mark and the stimulus must capture it faithfully. The closer the stimulus and the purchasing scenario come to genuine marketplace conditions, the more credible the survey—and the harder it is to dismiss as a laboratory artifact.
Objective coding
Finally, open-ended responses must be coded objectively—ideally by coders blind to the survey's purpose—rather than interpreted to favor the sponsor. Coding is where a lot of mischief hides: a respondent who says "I guess maybe it's related, I'm not really sure" can be coded as confused or as uncertain depending on who is holding the pen. Reliable surveys use a written coding protocol, blind coders, and an inter-coder reliability check, and they preserve the verbatim responses so the other side can audit the coding. A report that will not produce its verbatims is signaling something.
Sampling and sample size
A survey's credibility depends not only on whom it targets but on how it draws and how large a sample it draws from that universe.
The gold standard is probability sampling, in which every member of the relevant population has a known, non-zero chance of selection. Only probability samples permit valid statistical generalization from the sample to the universe with a calculable margin of error. In the real world, true probability sampling of consumers is difficult and expensive, and most modern trademark surveys use non-probability methods—online opt-in panels or mall-intercept interviews—that approximate representativeness. These methods are widely accepted when properly executed, but they carry trade-offs a competent expert must address candidly rather than paper over. Online panels raise questions about who joins panels and whether panelists mirror the relevant purchasers. Mall intercepts raise questions about geographic and demographic representativeness and about the conditions under which harried shoppers answer. The expert should disclose the sampling method, explain why it yields a sample representative of the universe, and report the limitations honestly. (For a flavor of how courts probe these methods on cross, see our companion piece on Daubert challenges to consumer survey experts.)
Sample size matters because it drives statistical precision. Too small a sample produces a confusion estimate with a margin of error so wide that the number is nearly meaningless—a reported 15% that could really be anywhere from 5% to 25% proves little. A respectable trademark survey interviews enough qualified respondents in each cell (test and control) to yield a usable margin of error, and the expert should report the cell sizes, the margin of error, and the confidence level (commonly 95%). A survey that buries a tiny sample, omits the margin of error, or fails to report the net figure invites the obvious inference: that the design was chosen to obscure rather than to measure. Conversely, a transparent report—clear universe, disclosed sampling method, adequate cell sizes, stated margins of error, and a netted result—signals rigor and is far more persuasive to judge and jury alike. Numbers presented with their uncertainty are trustworthy; numbers presented as if they were exact are not.
Whether to commission a survey at all
Not every case warrants a survey, and the threshold decision to commission one is itself strategic. Surveys are expensive and slow to design, field, and defend—often a five- or six-figure undertaking with months of lead time—and a poorly conceived survey can do more harm than good. A bad survey hands the opponent a Daubert target and, worse, can produce a number that undercuts the very party that paid for it. A litigant should commission a survey only when (1) the issue genuinely turns on consumer perception, (2) the budget and schedule permit a rigorous design, and (3) a competent expert believes a defensible survey can actually be built for the marks and market at issue. If a credible expert cannot in good conscience design a survey that will find what you hope it finds, that is itself useful information about your case.
Importantly, a survey is not mandatory. Likelihood of confusion can be proven through the full range of evidence—the similarity of the marks, the proximity of the goods, channels of trade, the defendant's intent, and above all real-world evidence of actual confusion—without any survey at all. While some older authority flirted with drawing an adverse inference from a party's failure to produce a survey, the better and prevailing view is that no such inference is required: a party is free to prove confusion by other means, and the absence of a survey is not, standing alone, a concession that confusion is unlikely. (Survey evidence is often hardest to obtain for small businesses with limited budgets, which makes the other Polaroid factors and anecdotal evidence of actual confusion all the more important; see Navigating the maze of trademark confusion.)
That said, in a closely contested case a strong survey can be decisive, and its absence may leave a gap the other factors must fill. The calculus is practical: commission a survey when it can be done well and would meaningfully strengthen the case; rely on other evidence when a defensible survey is not feasible or not worth the risk. The blunt rule of thumb among trademark litigators is that a mediocre survey is often worse than no survey at all—because it gives the other side a number to attack and a victory to claim when it falls.
How courts evaluate methodology under Daubert and Rule 702
Because surveys are expert testimony, their methodology is tested through the gatekeeping framework we examine at length in our companion article on Daubert challenges to consumer survey experts. The essentials bear repeating here in the methodology context, because the design choices in this guide are precisely the criteria on which admissibility turns.
The Rules: 702 and 703
Two Federal Rules of Evidence do the heavy lifting. Federal Rule of Evidence 702 governs the admissibility of expert testimony: a qualified expert may testify if the testimony will help the trier of fact, is based on sufficient facts or data, is the product of reliable principles and methods, and reflects a reliable application of those methods to the facts of the case. The 2023 amendment to Rule 702 sharpened two points that matter enormously for surveys. First, it made explicit that the proponent must establish each of these requirements by a preponderance of the evidence—reliability is a threshold the proponent must clear, not a presumption the opponent must rebut. Second, it underscored that the expert's opinion must reflect a reliable application of the methodology, directing courts to scrutinize whether the conclusions actually follow from the data rather than waving every flaw through to the jury as "weight." The Advisory Committee was candid that the amendment responded to courts that had been too permissive.
Federal Rule of Evidence 703 governs the data an expert may rely on. It allows an expert to base an opinion on facts or data that need not themselves be admissible, if experts in the field would reasonably rely on such data. For survey experts, Rule 703 is the doorway through which the underlying survey responses—classic out-of-court statements—come into evidence: courts have long held that a properly conducted survey is the kind of data survey experts reasonably rely upon, which is also why a survey that departs from accepted methodology can be challenged as resting on data no competent expert would trust. Rule 703 thus loops back to methodology: reliance is reasonable only if the survey was sound.
The Daubert/Kumho gatekeeping standard
Under the Daubert trilogy—Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993), General Electric Co. v. Joiner, 522 U.S. 136 (1997), and Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999)—the trial judge serves as gatekeeper, admitting expert testimony only if it is reliable and "fits" the issues. Kumho Tire supplies the key move for surveys: the gatekeeping obligation extends beyond hard science to all expert testimony, including social-science evidence like surveys, but the yardstick of reliability is the body of accepted survey-research standards—the Reference Manual's criteria and the survey case law—rather than the literal laboratory factors (testability, error rate, peer review) that fit a chemistry experiment better than a consumer poll. A survey that conforms to accepted methodology is reliable; one that departs materially from it is not. This is why the Reference Guide on Survey Research functions, in practice, as the checklist a federal judge runs through.
Weight versus admissibility
Courts have long debated whether survey flaws go to weight (admit the survey, let the jury discount it) or admissibility (exclude it). The traditional default treated most methodological imperfections as weight—a generous posture rooted in the idea that cross-examination and competing experts could expose flaws for the jury. But the line was always that fundamental defects—wrong universe, leading questions, no control, an unrealistic stimulus—can render a survey so unreliable as to warrant exclusion or no weight. The 2023 Rule 702 amendment has pushed courts toward more genuine gatekeeping of serious defects, and away from the reflexive "goes to weight" disposition that once let badly flawed surveys reach juries.
The practical lesson for methodology is bracing: the design choices described in this guide are not academic niceties. They are the exact criteria on which admissibility and weight turn. A survey built in conformity with accepted practice will generally be admitted, with its remaining quibbles left to the jury. A survey with a foundational defect faces real exclusion risk—and even a survey that squeaks past the gate can be so battered on cross that its number is worthless. The Second Circuit and its district courts, including the Eastern District of New York, apply this framework with methodological sophistication, willing to admit sound surveys and to dissect and exclude fundamentally flawed ones. A flaw serious enough to exclude under Rule 702 will often also be excludable under Rule 403, whose balancing of probative value against unfair prejudice we address in Federal Rule of Evidence 403 and unfair prejudice—a misleading survey can confuse the jury and waste time, and judges sometimes reach for 403 as a backstop to 702.
What level of confusion is meaningful?
Once a survey is admitted, the question becomes what its number actually proves, and clients and lawyers alike want a threshold: how much confusion is "enough"? There is no bright-line rule, but the case law and treatises supply useful guideposts—subject to two essential caveats.
The first caveat is that the meaningful figure is the net confusion (test minus control), not the raw test-cell percentage. A raw figure unadjusted for noise overstates confusion and should be discounted accordingly; the netted figure is the one courts credit. Anyone quoting a raw number as if it were the result is either careless or hoping you are.
With that caveat, the general guideposts are these. Net confusion rates of roughly 15% and above are commonly treated as substantial evidence supporting a likelihood of confusion, with higher figures correspondingly more persuasive; numbers in the 20s and 30s are strong. Net rates below about 10% are frequently regarded as weak and are often cited as tending to negate a likelihood of confusion. The range in between—roughly 10% to 15%—is a genuine gray zone, where the figure's weight depends heavily on the survey's quality and the strength of the other Polaroid (or Sleekcraft, in the Ninth Circuit) factors. These are heuristics drawn from the body of decisions collected in the treatises, not statutory thresholds, and courts vary considerably; a methodologically pristine survey at the low end may carry more weight than a sloppy survey at the high end. Treat the percentages as a thermometer reading, not a verdict.
The second caveat is that survey confusion is only one of the likelihood-of-confusion factors, and courts weigh it alongside the others rather than treating any percentage as dispositive. Indeed, courts have rejected surveys—and the confusion levels they reported—where the methodology did not support the number. In Universal City Studios, Inc. v. Nintendo Co., 746 F.2d 112 (2d Cir. 1984)—the storied fight over whether Nintendo's "Donkey Kong" infringed Universal's "King Kong"—the Second Circuit scrutinized survey evidence offered to show confusion and found it unpersuasive, illustrating that a reported percentage means nothing if the methodology behind it does not hold up. The number is only as meaningful as the survey that produced it. A high confusion figure from a flawed instrument is worth less than a modest figure from a rigorous one, and a good cross-examiner will spend more time on how the number was generated than on the number itself.
Surveys for other issues: secondary meaning, fame, and genericness
Although likelihood of confusion is the most common subject of trademark surveys, the same methodological discipline governs surveys offered on other issues, and a brief map is useful—because each issue has its own recognized formats but the same underlying rules of rigor.
Secondary meaning
Secondary meaning surveys test whether a descriptive term has come to identify a single source in the minds of the relevant public—the showing required for a descriptive mark to be protectable. Such surveys typically ask whether respondents associate the term with one company or many, or whether they recognize it as a brand, and they rise or fall on the same universe, control, and question-design principles as confusion surveys. A secondary-meaning survey that fails to control for respondents who would associate any term with a single source proves little.
Fame
Fame surveys, offered in dilution cases under the Trademark Dilution Revision Act, measure the breadth of public recognition of a mark. Because dilution protects only marks that are famous to the general consuming public of the United States, the universe for a fame survey is broad—the general public—not merely the purchasers of a niche product. A fame survey that samples only a specialized buyer pool has measured the wrong universe for the legal question, a recurring error that mirrors the universe problems discussed above.
Genericness: the Teflon and Thermos formats
Genericness surveys—offered to prove or disprove that a term has become the common name for a product (the fate that befell "escalator," "aspirin," and "cellophane")—use two specialized, court-recognized formats. (We treat the underlying doctrine in Trademark genericness and genericide.)
The Teflon format takes its name from E. I. DuPont de Nemours & Co. v. Yoshida International, Inc., 393 F. Supp. 502 (E.D.N.Y. 1975), where DuPont defended its TEFLON mark. It is, in essence, a mini-quiz. The interviewer first gives respondents a short tutorial on the difference between a brand name (a name made by one company, like "Chevrolet") and a common name (a name for a kind of product, like "automobile"), often testing comprehension with practice items. Then respondents are asked to classify a series of terms—including the term at issue, salted among genuine brand names and genuine common names—as one or the other. The result measures whether the public treats the term as a source identifier or a generic category name. Courts and the TTAB generally prefer the Teflon format because its structured classification task tends to produce cleaner, more interpretable data than the open-ended alternative.
The Thermos format, named for the litigation that turned THERMOS into a generic term—American Thermos Products Co. v. Aladdin Industries, Inc., 207 F. Supp. 9 (D. Conn. 1962), aff'd sub nom. King-Seeley Thermos Co. v. Aladdin Industries, Inc., 321 F.2d 577 (2d Cir. 1963)—takes the opposite tack. Rather than asking respondents to classify a term, it asks them what they would call the product: "If you were going to buy one of these, what would you ask for?" If respondents spontaneously use the term as the name of the product ("I'd ask for a thermos"), that is evidence of generic use. The Thermos format has the virtue of capturing spontaneous, real-world language, but it is harder to design well and more prone to ambiguity, which is why courts often prefer Teflon when both are available. Each format must satisfy the same foundational requirements—proper universe, neutral questions, sound execution, and objective coding—as any confusion survey.
The unifying point is that survey methodology is a single discipline applied to different questions. Whatever the issue, the survey must define the right universe, ask neutral questions, control for noise where applicable, replicate real conditions, and be executed and analyzed objectively. A survey that honors those principles is credible regardless of the issue; one that violates them is vulnerable regardless of the issue.
Reading a survey report critically: a checklist that mirrors the design
Whether you are evaluating your own expert's work or dissecting the opponent's, a survey report should be read in a consistent order that tracks the design choices that matter most. Reading in this order turns a dense expert report into a structured assessment and points straight at the soft spots.
- Universe. How is the relevant population defined? Do the screening questions actually capture it? Is it the right universe for the theory of confusion—junior user's purchasers for forward confusion, senior user's for reverse? This is where the most fatal flaws hide.
- Format and stimulus. Is the chosen format (Eveready or Squirt) appropriate to the strength of the mark? Does the stimulus replicate how consumers really encounter the marks, or is it an artificial side-by-side that no shopper would ever see?
- Questions. Are they open-ended and neutral, or do they lead the respondent toward a connection? Is there a genuine "don't know" option? Were spontaneous answers neutrally probed for why?
- Control and net. Is there a control? Is it designed correctly—identical but for the challenged feature? And, decisively, what is the net confusion after subtracting it?
- Administration. Was the survey double-blind? Who fielded it, and did they know the sponsor or the answer the sponsor wanted?
- Sample. How was it drawn? Is it representative? Is it large enough? Are the cell sizes, margin of error, and confidence level reported?
- Coding and analysis. Were open-ended responses coded objectively, ideally by blind coders against a written protocol? Are the verbatims preserved and produced? Does the reported conclusion actually follow from the data?
A report that answers each of these transparently and favorably is a strong one. A report that is silent on the control, vague about the universe, or coy about sample size is, by its silences, telling you exactly where its weaknesses lie. The questions a report refuses to answer are usually the questions you should press hardest at deposition.
A worked example
The following dispute is hypothetical, invented to show how methodology drives the result.
"Lumière," a long-established and heavily advertised maker of premium kitchen knives, sues "Lumina," a newer company selling kitchen knives under a similar name, for trademark infringement. Each side commissions a confusion survey, and the two reach opposite conclusions—not because consumers disagree, but because the surveys were built differently.
Lumière's expert recognizes that LUMIÈRE is a strong, well-known mark and chooses the Eveready format. The universe is prospective purchasers of Lumina's (the junior user's) knives, qualified by screeners confirming that respondents buy or plan to buy premium kitchen knives in the relevant price range. Respondents see only Lumina's product, with its actual packaging, and answer open-ended questions: who puts out this product, and what makes you say so? A meaningful share spontaneously name or describe Lumière as the source. A control cell, showing an otherwise-identical knife under a plainly different, invented name (say, "Solano"), registers low background noise of about 5%. The test cell comes in at 27%. The net confusion is 27% − 5% = 22%. Because the format is memory-reliant and non-suggestive, the universe is correct, the administration is double-blind, and the figure is netted against a real control, the survey is methodologically sound, likely admissible, and persuasive—a credible 22% net.
Lumina's expert, by contrast, builds a survey designed (whether by intention or by error) to find little confusion. It uses an awkward variant that shows respondents only Lumière's product and asks whether they recognize the brand—measuring brand recognition, not confusion about Lumina's source—and it samples a broad general-public universe rather than prospective premium-knife purchasers. It uses no control. Unsurprisingly, it reports negligible "confusion." On a Daubert challenge, this survey is exposed on three independent grounds: the wrong universe (general public, and arguably the wrong product shown), a question that does not test source confusion at all, and no control. It measures the wrong thing with the wrong tool and is a strong candidate for exclusion or no weight. Lumière's survey, meanwhile, sails through. The dispute was a coin flip on the merits; the methodology decided it.
Now change one fact. Suppose the marks were weak rather than strong—two obscure regional brands consumers could not summon from memory. The Eveready format would now be inappropriate, because respondents cannot spontaneously name a senior brand they have never heard of, and an Eveready survey would show near-zero confusion even if confusion were genuinely likely. The right tool becomes a properly designed Squirt survey: sample prospective purchasers of the junior product, show both marks in a non-suggestive array salted with other marks, ask a neutral same-source question, and—critically—run a robust control to strip out the juxtaposition noise the Squirt format invites. Done this way, a Squirt survey yields a credible net figure. The very same Squirt survey without a control, with a leading same-source question, on a general-public universe, would inflate confusion through suggestion and collapse under scrutiny. The dispute is identical; the methodology determines the answer. That is the whole lesson of this guide in one paragraph.
Frequently asked questions
Is a consumer survey required to win a trademark infringement case? No. Likelihood of confusion can be proven through the similarity of the marks, the proximity of the goods, the channels of trade, the defendant's intent, and—often most powerfully—real-world evidence of actual confusion, all without a survey. The better view is that no adverse inference arises merely because a party did not commission a survey. But in a close case, a strong survey can be decisive, and its absence can leave a gap.
What is "net confusion," and why does everyone keep mentioning it? Net confusion is the confusion rate in the test cell minus the confusion rate in the control cell. The control measures the background "noise" that any survey generates even when there is nothing infringing to be confused about; subtracting it isolates the confusion actually caused by the defendant's mark. A raw test-cell figure with no control overstates confusion and is, for legal purposes, close to meaningless. Always ask for the net.
When should I use Eveready versus Squirt? Use Eveready—which shows only the junior mark and asks open-ended questions—when the senior mark is strong and famous enough that consumers carry it in their heads; it is the gold standard because it minimizes suggestion. Use Squirt—which shows both marks—when the senior mark is weaker (so respondents cannot summon it from memory) or when consumers genuinely encounter both marks together in the marketplace. Squirt demands especially careful controls because juxtaposing the marks can manufacture confusion.
How much confusion is "enough"? There is no bright line, but as a rough heuristic, net confusion of roughly 15% and above is commonly treated as substantial support for a likelihood of confusion, below about 10% as weak (and often as negating confusion), and 10–15% as a gray zone whose weight depends on survey quality and the other factors. These are guideposts from the case law, not statutory thresholds, and a rigorous low-end survey can outweigh a sloppy high-end one.
What is the single most common fatal flaw in a trademark survey? The wrong universe. A survey that asks the wrong population has no probative answer to give and usually cannot be rehabilitated by cross-examination or argument. For forward confusion, the universe is prospective purchasers of the junior (defendant's) goods; for reverse confusion, it flips to purchasers of the senior (plaintiff's) goods.
Does the 2023 amendment to Rule 702 change how courts treat surveys? Yes, at the margins but meaningfully. It makes explicit that the proponent must establish reliability by a preponderance of the evidence and that the expert's opinion must reflect a reliable application of the methodology. In practice it has nudged courts away from reflexively admitting flawed surveys as going merely to "weight" and toward genuine gatekeeping of fundamental defects—wrong universe, no control, leading questions, unrealistic stimulus.
What is the difference between a Teflon and a Thermos survey? Both test genericness. A Teflon survey gives respondents a tutorial on brand names versus common names and then asks them to classify a list of terms; courts generally prefer it for producing cleaner data. A Thermos survey asks respondents what they would call the product, capturing spontaneous generic use. Each must satisfy the same universe, neutrality, and execution standards as any other survey.
Practical takeaways
For the proponent building a survey (offense), methodology is destiny, and it must be right from the first decision. Define the universe correctly for the theory of confusion—prospective purchasers of the junior user's goods for forward confusion, the senior user's for reverse—and operationalize it with careful screeners. Choose the format that fits the mark: Eveready for strong, memorable marks; Squirt, with rigorous controls, for weaker marks or marks genuinely encountered together. Ask open-ended, non-leading questions, offer a real "don't know" option, rotate to neutralize order effects, and administer double-blind. Replicate marketplace conditions in the stimulus. Always run a control and report the net confusion, not the raw figure. Retain an experienced survey expert early, document the design rationale in writing, preserve and produce the verbatims, and build the survey to withstand a Daubert challenge from day one—because a survey designed defensively is a survey that survives.
For the challenger attacking a survey (defense), the methodology is your map to the flaw. Start with the universe: was the right population sampled, and were the screeners sound? Examine the format: was it appropriate for the strength of the mark, and did it avoid (for Squirt) needless suggestion? Find the control: is there one, and what is the net confusion after subtracting it? Scrutinize the questions for leading language and demand effects, the administration for double-blinding, the stimulus for marketplace realism, and the sampling and coding for representativeness and objectivity. Frame the fundamental defects—wrong universe, no control, leading questions, artificial stimulus—as reliability and fit failures under amended Rule 702 (with Rule 403 as a backstop), and reserve the minor quibbles for cross-examination so you do not dilute your best points. A focused attack on a foundational flaw is what wins exclusion or strips a survey of weight; a scattershot list of nitpicks rarely does.
For both sides, the unifying truth is that a trademark survey is a scientific instrument whose value lives entirely in its methodology. The universe must be right, the format must fit, the questions must be neutral, the administration must be blind, the conditions must be realistic, and the confusion must be netted against a control before any percentage means anything at all. Courts in the Second Circuit and the Eastern District of New York read surveys with exactly this discipline. Build a survey that honors these principles and it becomes a decisive asset; build one that violates them and it becomes a decisive liability—and either way, the side that understands the methodology controls the outcome.
Related Articles
- Daubert Challenges to Consumer Survey Experts in Trademark Litigation — the admissibility gateway through which survey methodology is tested.
- Polaroid Factors on Summary Judgment in the Second Circuit — how survey confusion evidence feeds the likelihood-of-confusion analysis.
- Navigating the Maze of Trademark Confusion: Key Considerations for Brand Owners — the broader confusion landscape into which surveys fit.
- Federal Rule of Evidence 403 and Unfair Prejudice — the balancing test that can exclude a misleading survey even when Rule 702 is a close call.
- Reverse Confusion in Trademark Law — the theory that flips the proper survey universe.
- Trademark Genericness and Genericide — where Teflon and Thermos genericness surveys come into play.
- Secondary Meaning and Acquired Distinctiveness — the issue that secondary-meaning surveys are designed to prove.
- The Intricate World of Trade Dress Protection — where marketplace-realistic stimuli matter most.
This article is provided for general informational purposes and does not constitute legal advice. Survey design and admissibility are fact-specific and depend on the marks, market, and theory of confusion at issue; consult qualified trademark litigation counsel and a qualified survey expert about any particular matter.