Updated · 10 min read
A/B testing in email: sample size, novelty, and what to report
A winner at 95% confidence doesn't mean a real lift. A losing variant in one test doesn't mean a broken idea. Most email A/B tests produce results that look decisive and don't survive replication — and the gap between statistical significance and operational significance is where most lifecycle experimentation effort gets wasted. This guide is about designing tests that produce decisions, not noise.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why the same test, run twice, often gives you opposite answers
A winner at 95% confidence doesn't mean a real lift. The gap between statistical significance and operational significance is where most lifecycle experimentation effort gets wasted.
Picture a clinical trial for a new drug. Researchers split patients into two groups: one gets the drug (the variant — the new thing being tested), the other gets a placebo (the control — the existing baseline). Then they wait, count outcomes, and ask: is the difference between the two groups bigger than what you'd expect from random chance alone? An email A/B test is exactly that, with worse stakes and lazier maths. Half your list gets the new subject line, half gets the existing one, and at the end somebody reports a winner.
The problem is that most lifecycle teams report winners the trial scientists wouldn't. A drug company that ran an under-powered trial — too few patients to reliably tell drug from coincidence — wouldn't ship the drug. Marketing teams ship the drug.
The first concept that decides whether your test is real or theatre is sample size — the number of users in each group, which sets how small a difference the test can reliably see. Tied to it is the minimum detectable effect (MDE — the smallest lift the test can spot above random noise). MDE is a function of sample size, your baseline conversion rate (the rate the control already converts at), and two thresholds you pick up front: confidence (typically 95%, the chance you're not calling random fluctuation a winner) and power (typically 80%, the chance you'll actually catch a real effect if one exists). Nothing exotic. All of it knowable before the test starts.
400
Conversions per variant needed to detect a 30% lift at 95% confidence, 80% power.
3,800
Conversions per variant needed to detect a 10% lift at the same thresholds.
400,000
Conversions per variant needed to detect a 1% lift. Most programs will never see this sample.
Translate that into practical numbers, assuming a 20% baseline conversion rate. To detect a 30% relative lift, you need roughly 400 conversions per variant. Want to catch a 15% lift? About 1,700. Drop to 10% and the cost climbs to ~3,800. A 5% lift needs around 15,000 per variant; a 1% lift, somewhere near 400,000. Pick the lift you're actually prepared to act on before you pick the test, not after.
Most lifecycle programs run tests with 500–2,000 conversions per variant and claim to detect 5% lifts. Mathematically, they can't. What's happening is one of three things: a genuine large lift (which is detectable), random noise misread as a lift (a false positive — the test calls a winner that isn't one), or a real small lift that wasn't reliably detected (a false negative — a real effect lost in noise). Without a pre-registered MDE, you can't tell which — and claiming the number as a win is a vibes-based decision dressed up as a quantitative one.
Is 95% confidence enough? For low-stakes decisions (subject lines, minor copy) 90% is fine. For high-stakes decisions that lock in strategy for a year (channel strategy, onboarding overhauls), require 99% and validate with a follow-up test. 95% is the convention, not a law. Confidence notation versus p-values — the same idea inverted, p being the probability the difference came from chance, so p=0.03 means 97% confidence — is mostly a stakeholder choice. Confidence reads more intuitively in a deck. P-values are tighter in analytical contexts. Many readouts include both. The notation matters less than showing the absolute volume alongside it: a 97%-confidence 12% lift adding 340 conversions is a different animal from one adding 34.
The Orbit Significance Calculatorshows the MDE for your current sample size inline — you'll see whether the test is actually powered to detect a meaningful lift before you run it, rather than finding out after.
When the new thing wins for the first three days and loses by week two
Switch the lens from sample size to time. Even a properly-powered test can mislead you if you read it too early — and the reason has nothing to do with maths.
The pattern is the novelty effect — when a variant outperforms control purely because it's new, not because it's better. Users notice the difference, engage with it for the first 3–7 days, then habituate and drift back to baseline behaviour. The thing you're measuring isn't the variant's long-run quality. It's how surprising the variant is the first time the user sees it. Two different questions, dressed in the same numbers.
Watching only cumulative metrics — the running total since the test began — obscures this completely. A variant 8% ahead on day 3 and 1% behind on day 10 shows up as "+3% cumulative" in a seven-day readout. That number reads as a marginal win and is actually the signature of fading novelty. Ship it and the lift evaporates within a cycle.
The defence is two pieces. First, run every A/B test for at least two full cycles of your natural sending rhythm — daily sends, two weeks; weekly sends, a month. Second, report weekly incremental lift (week-by-week, not cumulative) so you can see the trajectory. If the variant wins week one and loses week two, that's a novelty effect, not a winner. The discipline is boring and the number of teams who skip it is not.
The Orbit Experiment Design skill handles all of this pre-test planning — sample size calculation, duration setting, and the readout structure that separates real winners from novelty ones.
Why testing nine variants at once is mostly testing your luck
Intuition trap: more variants means a faster path to a winner. Maths says the opposite — more variants means a higher chance you call something a winner that isn't. This is the multiple-comparisons problem: every additional variant you bolt on inflates the odds that at least one of them clears the significance bar by chance alone.
Numbers are sharper than people expect. One variant versus control at 95% confidence: 5% false-positive rate — exactly what 95% confidence promises. Nine variants versus control, each tested at 95% confidence: the chance that at least one hits significance purely by chance is roughly 37%, not 5%. Four out of every ten "winning" tests in that setup are coincidences with a p-value stapled to them.
Textbook fix: Bonferroni correction — divide your chosen alpha (the false-positive rate you're willing to tolerate, usually 5%) by the number of comparisons. With 9 variants, require each individual comparison at 99.4% confidence (0.05 / 9) to maintain an overall 5% false-positive rate across the whole family of tests. Conservative, but honest.
For lifecycle programs the more practical move is structural: run fewer variants per test (2–4 tops), and run more tests in sequence. Sequential small tests beat parallel large ones for most lifecycle work — you learn faster, your MDE is lower per test because the per-variant sample is bigger, and the multiple-comparisons problem goes away for free. Pricing and discount tests have their own failure modes that general rules don't catch; the price-testing guide covers the specifics.
What a stakeholder-ready readout looks like (and why one number isn't one)
The classic readout sin is reporting a single number — "Variant B won by 12%" — and treating it as a decision. It isn't. It's a fragment. A real readout carries three numbers together for every test result: observed relative lift (the percentage difference between variant and control), confidence level, and absolute conversion volume (how many extra conversions the lift actually represents). Written out: "Variant B had a 12% relative lift in open rate, 97% confidence, adding 340 extra opens across 50K recipients."
Each number changes the decision. Twelve percent lift at 60% confidence? Noise — the test never cleared the bar. One percent lift at 99% confidence across 5M recipients? Possibly a major win — small percentages on big bases are real money. Thirty percent lift at 95% confidence that's only 40 extra conversions in absolute terms? Probably not worth the operational change to ship it. All three numbers together force a complete read; any one of them alone invites wishful thinking.
Also report the losing variants. Every losing variant contains information. A specific variant losing by a large margin tells you something about that specific direction — don't pretend it didn't happen because it wasn't your winner. The negative result is half the value of running the test in the first place.
What's worth the calendar time, and what's performance art
Not every test is worth running, even when the maths is clean. The question is whether the effect you're likely to see is large enough that your sample size can detect it, and meaningful enough that you'd act on it.
Reliably produce learnings: subject line structure variants, CTA copy and placement, hero-image versus text-first layouts, send-time optimisation, sender-name variations, personalisation depth (shallow versus deep). High inherent variance — meaning the variants tend to produce genuinely different outcomes — paired with clear hypotheses. Worth the calendar time.
Usually produce noise: small colour changes, minor layout tweaks, single-word copy edits, audiences under 10K per variant. Not because these things don't matter — they might — but because the size of effect they produce is too small for normal lifecycle volumes to reliably detect. Running the test anyway is performance art, not experimentation.
One decision rule, plainly stated: if you're not prepared to act on a 5% lift, don't run a test that can only detect a 20% lift. It won't answer a question you'd act on, and the time spent running it is time not spent on tests that would. Experimentation portfolios get the same discipline as campaign portfolios; ruthless prioritisation compounds.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
Frequently asked questions
- How many emails do I need for an A/B test?
- Depends on baseline conversion rate and minimum detectable effect. Rough thresholds: at a 3% baseline conversion rate, detecting a 10% relative lift (i.e., 3% → 3.3%) needs ~55,000 users per variant at 95% confidence / 80% power. Detecting a 20% lift needs ~14,000. Detecting a 5% lift needs ~215,000. The Orbit Sample Size Calculator at /apps/sample-size computes exact numbers for your specific baseline and MDE.
- What's a good A/B test for email?
- Tests with high inherent variance and clear hypotheses: subject line structure, CTA copy and placement, hero-image vs text-first layouts, send-time optimisation, sender-name variations, personalisation depth. Tests that usually produce noise: small colour changes, minor layout tweaks, single-word edits, tests on audiences under 10K per variant. If you can't detect a meaningful effect, running the test is performance art.
- How long should an A/B test run?
- Until it reaches the pre-calculated sample size at the declared significance and power levels. Not 7 days because that's the calendar week, not until a winner 'looks clear' after 48 hours. The time required is a function of daily volume and required sample size. A test that needs 100,000 users per arm at 10,000 daily volume takes 20 days — not 7, not 3. Stopping early inflates false positives dramatically.
- What's statistical significance in email A/B testing?
- Statistical significance is the probability that the observed difference between variants could have happened by chance. Conventional threshold: p-value below 0.05 = 95% confidence the effect is real. BUT significance depends on sample size — a test with 100 users per arm will never reach significance even if a real effect exists. Pre-compute required sample size before starting; run to that size; declare winner on the primary metric only. The Orbit Significance Calculator at /apps/significance runs the two-proportion z-test.
- Can I test subject lines after Apple MPP?
- Only on the click-through or conversion rate, not the open rate. Open rate from Apple Mail clients is inflated to near 100% by MPP's pre-fetching, so subject-line tests measured on opens are contaminated by Apple's share of the audience. Use click-to-open rate, click-through rate, or downstream conversion as the primary metric. Opens can still be a secondary diagnostic, not the KPI.
This guide is backed by an Orbit skill
Related guides
Browse allSample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
Price-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP markets an STO feature and every vendor deck shows lift. The honest version: STO moves open rate 3–8%, rarely revenue, and only for certain program types. Here's when it's worth turning on.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle look bigger than it is. Incrementality testing strips out users who would have converted anyway and surfaces the real number. This is how to design a test that produces a figure you can defend in front of a CFO.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.