Updated · 8 min read
False positives in email A/B tests: why half of winning tests don't actually win
An A/B test flags a winner with p<0.05. The lift looks real. The team ships it. But the test was one of many, the effect doesn't replicate, and the “win” quietly underperforms over the next quarter. This is the false-positive trap and it's everywhere in email testing. The statistics guarantee that at p=0.05 you'll see roughly one fake winner per twenty tests — and most programs don't even keep track of how many tests they ran. Here's how to prevent it.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
The fake winner problem — why noise looks like a win
Picture the lifecycle lead on a Monday standup. "Subject line test came back significant — variant B lifted opens 6%, p=0.04. Shipping it as the new template default." Nods around the room. The variant goes live. Three months later, opens haven't moved. Nobody checks. The "learning" sits in the playbook.
That story is the false-positive trap, and the maths makes it inevitable. The significance threshold p=0.05 — the number marketing platforms paint green when a test "wins" — means "the observed result would happen by chance less than 5% of the time if there were no real effect". Translated: 1 in 20 tests will show a significant result from pure noise, even when the variants are literally identical.
Run 20 subject line tests in a year. Statistically, expect 1 false winner even if none of them had any real effect. Run 40, expect 2. Most programs don't track their test count or the distribution of p-values they've seen, so the false-positive rate just quietly accumulates in the team's "proven best practices" folder until somebody runs an audit and nothing compounds.
A program running 50 A/B tests a year at p=0.05 should expect 2–3 false-positive winners. A program that ships every winner will have 2–3 imaginary improvements in its learnings and won't know which ones.
The thing worth internalising: p=0.05 is not "we're 95% sure the effect is real". It's a lower bound on the false-positive rate per test — a ceiling on how often pure noise crosses the line, nothing more. Once that distinction lands, most of the rest of this guide becomes obvious.
Three ways your tests lie to you
Three habits inflate the false-positive rate well above the 5% the textbook promises. Each one is common. Each one is fixable. Two of them are the default behaviour of mainstream testing tools, which is why they show up everywhere.
1. Stopping early. Running a test, peeking at the dashboard each morning, calling it the moment it crosses significance. Every peek is another roll of the dice — and the more rolls, the more chances noise has to sneak above the line. A test designed for n=20,000 (the planned sample size) but stopped at n=8,000 "because it looked significant" can have a real false-positive rate of 15–25% depending on how many peeks happened. That's not 5% any more. That's a quarter of your "wins" being noise.
2. Multiple comparisons without correction. Testing subject + preheader + CTA + send time in the same test and reporting whichever metric crossed p<0.05. Each comparison gets its own 5% chance of false-positive — stack four of them, and the family-wise rate (the chance that at least one of them is a fake) climbs to about 18%. The crude fix is Bonferroni correction: divide your significance threshold by the number of comparisons. It over-corrects, but over-correcting beats no correction, which is what most programs have.
3. Re-running until significant. "This test didn't quite hit significance — let's run it again with more users." Keep extending the runtime and you'll get significance eventually, from noise rather than a real effect. Pre-commit to a sample size and stick to it. The sample size guide covers how to pick the number up front so you don't have this conversation after the fact.
How to spot a fake before you ship it
Three sanity checks worth running on every "winner" before it goes into the playbook. Each one catches a different flavour of fake.
1. The effect-size sniff test. A subject line test showing a 25% open-rate lift? Be suspicious. Real subject line effects rarely clear 10–15%. Huge effects are more often noise than breakthroughs — small but replicable effects are far more trustworthy than large one-shot ones. Type-realistic ranges worth committing to memory: subject lines 2–8%, content 3–10%, send time 1–5%. Anything well above these ranges is usually noise or a methodology issue — small sample, MPP inflation (Apple Mail Privacy Protection, which auto-opens emails and pollutes the open-rate signal), peeking, or all three.
2. Replicate before banking. A real winner should hold up in the next send. Re-run the winning variant against a fresh control on the next campaign. If the effect disappears, it was noise. Programs that replicate every significant test before declaring a learning end up with a far cleaner playbook than programs that run each test once and move on.
3. Holdout validation. The portfolio-level check. Run a small (5–10%) holdout — a randomly-selected slice of your audience that receives none of your claimed winning treatments. If the sent group outperforms the holdout meaningfully on the metric you actually care about (revenue, retention, conversion), your collected wins are probably real. If the gap is small, many of your wins are noise and your "learnings" document is partially fiction.
Habits that stop fakes getting in
Detection catches false positives after the fact. Prevention stops them entering the program at all. Four practices, in rough order of impact.
1. Pre-register every test. Before running, write down: hypothesis, primary metric, sample size, stop criterion, what "winning" means. This kills the "motivated reasoning" move where, after the test, you highlight whichever metric happened to win. A shared doc or a basic test-registry tool is enough. It doesn't have to be fancy. It has to be written down before you see the data.
2. Tighter significance threshold. p=0.01 instead of p=0.05 cuts the false-positive rate roughly 5x. Trade-off: you need samples about 1.5x larger. For high-volume programs running many tests a year, the tighter threshold pays for itself in avoided false winners and a cleaner playbook. For lower-volume programs, p=0.05 is fine — as long as you replicate any winner before treating it as established.
3. Primary metric only. Pick one metric before the test — usually click rate or revenue per recipient. Don't report wins on secondary metrics. If the primary didn't win, the test is a null result, full stop. "Primary didn't win but secondary did" is how false positives enter the learnings document and never leave.
4. Sequential testing, done correctly. If you genuinely must peek, use a platform with built-in sequential analysis — the kind of statistics that adjusts p-values for repeated checks (Optimizely's sequential stats, VWO's SmartStats). Don't roll your own correction. The chances of getting it right are low and the consequences of getting it wrong look exactly like "I shipped a winner".
When something feels off — what to do
A test shows a surprisingly large effect. Or the same variant type has won three times in a row. The instinct is to bank the learning. The discipline is to slow down for forty-eight hours.
1. Re-run the test with a fresh audience split. See if it replicates at a similar effect size.
2. Sanity-check the effect against type-realistic ranges. A subject line showing 20% lift is suspicious. 4% is believable.
3. If it replicates at a similar size, it's probably real. If it disappears or collapses, the original was noise.
4. Update the learnings accordingly. False positives are especially expensive when they calcify into "known best practice" — roll those back aggressively when discovered, even when someone on the team would prefer you didn't.
And null results — tests that come back "no significant difference" — are informative, not failures. A null result tells you the effect is smaller than the MDE the test could detect (the minimum detectable effect — the smallest lift the test was statistically powered to find). That's real information. Run a retrospective power calculation, publish null results internally alongside the significant ones, and make it normal for the team to say "the data didn't show what we expected and that's fine".
One specific edge case worth naming: Apple MPP. It doesn't directly inflate false positives, but it adds so much noise to open-rate measurement that open-rate tests are unreliable by construction. Real effects get harder to detect; spurious effects from machine-open variation can look like real wins. For any test that matters, use click-through rate as the primary metric and dodge the MPP noise entirely.
builds pre-registration, sequential-testing practice, and replication into the default workflow. The goal isn't to run fewer tests. It's to extract more real signal from the tests you run — which, at scale, is the difference between a program that compounds learning and one that compounds fiction.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
Frequently asked questions
- What is a false positive in A/B testing?
- A false positive is an A/B test declared significant when no real effect exists — the apparent difference was noise. At 95% confidence, 5% of tests of truly identical variants will still produce significant results by chance. Every falsely-declared winner rolls out changes that don't actually work, and over time the cumulative drift from compounding false positives is substantial.
- How do I prevent false positives in A/B tests?
- Four disciplines. (1) Pre-declare a single primary metric before running — no post-hoc "which metric shows significance?" (2) Pre-compute required sample size from baseline rate and MDE — don't eyeball it. (3) Don't peek and stop early; run to the pre-computed sample size regardless of what intermediate results show. (4) For multiple variants against one control, apply Bonferroni correction to the significance threshold (divide alpha by number of comparisons).
- What's the peeking problem in A/B testing?
- Peeking is checking test results mid-run and stopping when significance first appears. Because significance is probabilistic — a truly-null test has random fluctuations that occasionally cross p<0.05 — peeking repeatedly inflates false-positive rate. Running a test that would normally have 5% false-positive rate with daily peeks over 14 days can push false-positive rate to 20%+. Commit to the sample size; look once at the end.
- Why does testing multiple metrics increase false positives?
- Because each independent comparison has its own 5% false-positive rate. Running 5 comparisons at p<0.05 produces an effective family-wise false-positive rate of ~23% (1 − 0.95^5). Testing 10 metrics produces ~40%. The fix: Bonferroni correction (divide alpha by number of comparisons), or Benjamini-Hochberg FDR for less conservative multi-testing, or the discipline to pre-declare one primary metric and ignore the rest.
- Is a 95% confidence level enough for email marketing decisions?
- Usually yes. 95% confidence (p<0.05) is the operator standard — corresponds to a 5% false-positive rate, which over many decisions averages to one false win per 20 tests. For high-stakes launches (a new pricing change, a major template redesign), bumping to 99% confidence (p<0.01) is warranted because the cost of rolling out a false winner is larger. For routine subject-line and copy tests, 95% is fine.
This guide is backed by an Orbit skill
Related guides
Browse allSample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
Price-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP markets an STO feature and every vendor deck shows lift. The honest version: STO moves open rate 3–8%, rarely revenue, and only for certain program types. Here's when it's worth turning on.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle look bigger than it is. Incrementality testing strips out users who would have converted anyway and surfaces the real number. This is how to design a test that produces a figure you can defend in front of a CFO.
Segment-based testing: when your average lift is hiding opposing effects
A winning A/B test with 4% aggregate lift might be a 20% win in one segment and a 10% loss in another. The aggregate is an average of opposing effects. Segment analysis catches it — and lets you ship the win to the segments that benefit while not shipping the loss to the ones that don't.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.