Updated · 8 min read
Sample size: the calculation everyone gets wrong in email A/B tests
Picture the standard email A/B test most teams run on a Tuesday: split the list 50/50, send variant A to one half, variant B to the other, wait a day, declare a winner from whatever p-value — the probability the difference between the two variants is just random noise — the platform spits out. The trouble is, the test was almost certainly too small to spot the kind of lift that actually happens. So the “winner” is either noise dressed up as signal, or a real result the test was blind to. Five minutes of arithmetic before you press send fixes the lot. Here’s the short version.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Every A/B test has a vision range. Most teams never check theirs
Start with the picture, not the formula. Every A/B test has a kind of vision range — the smallest difference between two variants it's capable of seeing. Send to too few people and the range is wide: the test can spot a 10% lift but is blind to anything smaller. Send to many more people and the range narrows: now it can spot a 2% lift cleanly. Sample size — how many users you put into each variant — is the dial that controls it.
That vision range has a name in stats land. It's called the minimum detectable effect, or MDE — the smallest lift the test can reliably tell apart from random noise. So if the real effect of your treatment is 3% and your test is sized to detect 10%, the test physically cannot see what's there. It returns "no significant difference," the team calls it a null, and a genuine win goes home in a ziplock bag.
Running an underpowered test isn't a neutral act. You've spent the send, burnt the audience, and learned nothing. Worse, you might conclude there's no effect when there is.
Four numbers determine how many users you need. None of them are negotiable, but all of them are knobs you can turn:
1. Baseline conversion rate. What the control arm — the unchanged version, your existing email — is expected to convert at. Open rate, click rate, purchase rate, whichever metric the test is judged on.
2. Minimum detectable effect (MDE). The smallest lift you actually care about catching. Usually expressed as a relative percentage: a 5% relative MDE on a 25% baseline means detecting a move from 25% → 26.25%.
3. Significance level (α). The false-positive rate you'll accept — how often you're willing to declare a winner that isn't one. Conventionally 0.05, i.e. a 5% chance of crying wolf.
4. Statistical power (1-β). Probability the test catches a real effect when one is genuinely there. Conventionally 0.80 — an 80% chance of seeing a true win, a 20% chance of missing it.
The arithmetic — and the calculator that does it for you
Most email tests compare two proportions: the conversion rate of variant A versus the conversion rate of variant B. The full formula uses standard-error calculations (the maths of how spread out a result is around its average), but this shortcut gets you within a few percent of the right answer:
n per variant ≈ 16 × p(1-p) / (MDE × p)²
Where p is the baseline rate (as a decimal, so 0.25 for 25%) and MDE is the relative detectable effect (0.05 for a 5% relative lift).
Worked example. A welcome email currently opens at 25%, and you want to detect a 5% relative lift — i.e. spot a move from 25% to 26.25%. Plugging in: n = 16 × 0.25 × 0.75 / (0.05 × 0.25)² = 19,200 per variant. Total test: 38,400 users.
Don't fancy the algebra by hand? Free calculators (Evan Miller's, Optimizely's, AB Tasty's) all do exactly this. Plug in baseline, MDE, α=0.05, power=0.80. Read the users-per-variant out; double for the total. The result is the floor — anything below that and the test is, by definition, too blind to call.
When your list is too small for the test you want to run
Lifecycle teams hit this wall in the first month, almost without exception. The fix is to change what you're testing, not bend the maths.
Test bigger swings. If your list only supports detecting a 15% lift, single-word subject-line tweaks are a waste of sends — the effect is below your vision range. Test full subject-line rewrites, full creative changes, full template overhauls. Detectable change or no change.
Pool similar sends. Instead of one test on one send, run the same treatment across five comparable sends and combine the data. Five sends of 10,000 each gives you a 50,000-user effective sample — same statistical power, just spread across a campaign rather than a single moment.
Use pre/post comparisons for the big stuff. Not technically an A/B test, but fine for surfacing directional lift from the kind of changes that don't split cleanly — full template rebuild, new program launch, fundamental flow change. Compare 4 weeks before to 4 weeks after, controlling for volume and seasonality.
Test at the program level, not the send level. Instead of subject-line-of-the-week, test entire flow variants — three-email sequence A versus three-email sequence B, with users randomised to one arm for the whole journey. The effect compounds across the touchpoints, which surfaces program-level lifts your list could never reveal at a single-send level.
Where teams quietly cheat the inputs (and don't realise it)
Sample size is only as honest as the four numbers you feed it. The same four mistakes show up everywhere:
Baseline rate too optimistic. Borrowing an industry benchmark ("email open rates average 30%") when your actual rate is 18%. The smaller the real baseline, the larger the sample needed — so flattering the input quietly under-sizes the test. Use your own number, always. Pull it from the last four sends.
MDE too optimistic. Setting MDE at 10% because last quarter's blog post said that's "what's possible." Real effects on subject-line tests cluster around 1–5%. Content tests, 3–8%. Sizing for a 10% MDE means your test is calibrated to find a unicorn and miss everything else.
Power too low. 0.80 is the conventional floor. Drop to 0.70 or lower and you're inflating the false-negative rate — tests that say "no effect" when there absolutely was one. Don't cut power just to make the required sample fit your list. Either accept the bigger sample or accept that you can't detect small effects with the audience you've got.
The subtler trap: the list growing during the test. Lock the eligible audience the moment the test starts. Users who subscribe mid-test belong to the next test, not this one. Keep that boundary clean and you avoid the "variant B got more engaged users by accident of timing" artefact that quietly poisons quarterly readouts.
The A/B testing playbook covers the broader test-design questions; this guide is the maths underneath it.
Two situations where the standard recipe needs adjusting
Two common variants need their own handling — both are places where teams genuinely don't know they're bending the rules.
Multiple metrics in one test. Testing a new subject line and reading open rate, click rate, click-to-open rate, and revenue per recipient all at once. The trouble: the more metrics you check, the higher the chance one of them looks "significant" by random chance — like rolling enough dice that one of them eventually comes up six. Two ways out. Either pre-register one primary metric (usually click-through or revenue per recipient) and treat the rest as exploratory, or apply a multiple-comparisons correction. Bonferroni — divide your significance threshold (α) by the number of metrics you're checking — is the simplest. Most email programs get away without correction because the funnel metrics are tightly correlated, but the moment you're testing genuinely unrelated metrics simultaneously, correction matters.
Peeking at the data and stopping when it hits significance. Tempting because the platforms encourage it. Statistically catastrophic. Every additional peek inflates the false-positive rate substantially — by the fifth peek you might be running an effective α of 14% instead of the 5% you signed up for. Two clean options. Pre-commit to a sample size and check only at the end, or use platforms with sequential analysis (the maths designed for ongoing peeks) built in — Optimizely, VWO SmartStats. Rolling your own corrections is how teams end up with confidently-stated nonsense.
After the test ends: the calculation that keeps you honest
One more habit, and it's the one most programs skip. After the test ends, run the calculation in reverse: given the actual baseline rate the test produced, and the actual sample size you ran, what's the smallest effect this test could have detected? That's the retrospective MDE — the test's actual vision range, measured in hindsight rather than predicted up front.
If the test returned "no significant difference" and the retrospective MDE comes back at 10%, the only honest conclusion is: "no effect of 10% or larger." A 4% lift could absolutely still be there, hiding inside the test's blind spot. The test didn't prove there's no effect — it proved you weren't sized to see one.
This is the discipline most teams skip. "No significant difference" is not "no effect" — it's "no effect large enough for this particular test to detect." Treat null results as absence of evidence, not evidence of absence. That single phrasing change kills about half the bad conclusions a marketing org draws from underpowered tests.
How do you explain this to a non-technical stakeholder when they ask why you can't just "run a quick test"? One sentence usually does it: "This test needs X users per variant to detect a Y% lift. If we want to find smaller effects, we need more users. If we accept we can only detect bigger effects, we can run with fewer. Either's fine — the only wrong answer is running a small test and reporting the result as if it told us anything."
bakes pre-test sample size calculation and post-test retrospective MDE into the standard output of every experiment. Tests without either tend to become theatre — confident conclusions from inconclusive data, presented convincingly enough that nobody in the room pushes back.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
Frequently asked questions
- How do I calculate sample size for an email A/B test?
- Three inputs: baseline conversion rate (current rate you're comparing against), minimum detectable effect (smallest relative lift you want to detect — 10% means "detect a lift from 3% to 3.3%"), and confidence + power settings (95% confidence / 80% power is standard). Formula uses a two-proportion z-test. The Orbit Sample Size Calculator at /apps/sample-size computes it for any baseline and MDE and also gives expected test duration if you provide daily volume.
- What's the minimum audience size for an email A/B test?
- Depends entirely on baseline rate and MDE. Rough thresholds at 95%/80%: a 3% baseline with 10% MDE needs ~55,000 per variant; loosen the MDE to 20% and that drops to ~14,000; tighten it to 5% and it explodes to ~215,000. At a 10% baseline with 10% MDE you need around 16,000 per variant. The tighter the effect you want to detect, the larger the sample required — roughly quadrupling as you halve the MDE.
- What's the difference between MDE and statistical significance?
- MDE (minimum detectable effect) is set BEFORE the test and determines required sample size. It answers: what's the smallest lift this test can reliably detect? Statistical significance is measured AFTER the test and answers: given the results, how likely is the observed difference to be real? An undersampled test can't produce significance for a real effect; a sufficiently-sampled test produces significance when the effect is real (at the chosen confidence level).
- Why do I need 80% power?
- Power is the probability that the test correctly detects an effect when there is one. 80% power means 80% of the time, a test with a real underlying effect will produce a significant result. Setting power lower (70% or 60%) means more real effects go undetected — you shrug off real wins as "inconclusive." Setting power higher (90% or 95%) requires larger samples but catches more real effects. 80% is the standard trade-off point for most marketing decisions; 90% is used when the cost of missing a real effect is high.
- Can I run multiple A/B tests on the same audience?
- Yes, but with discipline. Testing two changes in the same email (different subject line AND different CTA) is a factorial test — analyse as a 2×2 rather than separate A/B tests, or run one change at a time. Testing different changes on different segments of the same list is fine as long as the segments don't overlap in a way that would contaminate results. What's NOT okay is comparing variant A to control, variant B to control, and variant C to control all at once without adjusting for multiple comparisons — that inflates false-positive rate.
This guide is backed by an Orbit skill
Related guides
Browse allA/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Price-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP markets an STO feature and every vendor deck shows lift. The honest version: STO moves open rate 3–8%, rarely revenue, and only for certain program types. Here's when it's worth turning on.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle look bigger than it is. Incrementality testing strips out users who would have converted anyway and surfaces the real number. This is how to design a test that produces a figure you can defend in front of a CFO.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.