How big does my A/B test need to be?

Depends on your baseline conversion rate and the minimum detectable effect you want to find. Rough rule: n per variant ≈ 16 × p(1-p) / (MDE × p)². A welcome email with 25% open rate wanting 5% relative lift needs ~19,000 users per variant. The sample size guide covers the full calculation.

What's the difference between an A/B test and a holdout?

An A/B test compares two treatments against each other (subject line A vs B). A holdout compares treatment vs no treatment (program on vs program off). You use A/B tests to optimise within a program; you use holdouts to measure whether the program is actually producing incremental lift vs just capturing users who'd have converted anyway.

How do I measure attribution for lifecycle marketing?

Depends on the question. 'Did this specific email drive the sale?' — last-click. 'What's the ROI of the whole lifecycle function?' — multi-touch + holdout. 'Which channel gets credit for new customers?' — first-touch or data-driven. Using one model to answer all three produces wrong answers consistently.

What's the most common measurement mistake?

Reporting last-click revenue as incremental revenue. Cart abandonment gets credited with purchases that would have happened anyway. Winback gets credited with users who'd have returned organically. The gap between last-click and incremental is typically 40–70% — meaningful. Holdouts close the gap.

How long should I run a test?

Pre-commit to a sample size, not a duration. Running 'until significant' inflates false-positive rate substantially. Calculate the needed sample, run until the sample is reached, then read. If you need the option to stop early, use a platform with sequential-testing statistics (Optimizely, VWO) that corrects for the peeking.

Does send-time optimisation actually work?

Modestly — 3–8% open rate lift on broadcasts, little to no revenue lift. Vendor case studies claim 20–40%+; measured reality is smaller because much of the claimed lift is Apple MPP inflation rather than real engagement. Worth enabling for broadcasts at scale; skip for triggered sends and small audiences.

experimentation

Measurement & testing: how to know if any of it is actually working

Measurement is where most lifecycle programs fool themselves. Running tests without sample-size math. Declaring winners from noise. Confusing last-click revenue with incremental revenue. These guides cover the discipline that separates real learning from confirmation theatre.

A lifecycle team that runs 20 A/B tests a year at p=0.05 should statistically expect 1 false-positive winner from pure noise. Most teams don't track how many tests they've run, so the false winners become 'learnings', propagate through the playbook, and quietly underperform. The gap between the claimed lifts and the aggregate program improvement is the tax of undisciplined experimentation.

The guides in this category cover the full testing stack. Sample size calculation — the 5-minute math that tells you whether a test can detect the effect you're looking for before you run it. The holdout group pattern — randomly suppressing a small population from a program so you can see its real incremental lift, not just its last-click attributed revenue. A/B testing structure — one primary metric, pre-registered, sized for a realistic effect, read at the end, not during.

Then the measurement stack. Cohort retention analysis — the one chart that tells you if retention is actually improving, stratified by cohort week or signup channel. Attribution models and which one to use for which question (first-touch for acquisition, last-click for transactional, multi-touch for anything in between, holdout for the honest incrementality answer). Send-time optimisation and the gap between vendor-claimed and measured lift. False-positive prevention and how to spot a 'winning' test that will not replicate.

Read these before you run the next test. Running an underpowered test isn't neutral — it spends the audience and produces conclusions that range from useless to actively wrong.

A winning A/B test with 4% aggregate lift might be a 20% win in one segment and a 10% loss in another. The aggregate is an average of opposing effects. Segment analysis catches it — and lets you ship the win to the segments that benefit while not shipping the loss to the ones that don't.

8 min read

Advanced

Measuring AI personalisation lift honestly

Every vendor case study shows AI personalisation moving the numbers. Most internal post-mortems show the lift evaporating once a proper holdout is in place. The gap between the two is the measurement methodology. Here's the framework for proving — to yourself, your CFO, and the auditor — whether AI personalisation is actually earning its place.

8 min read

Frequently asked questions

How big does my A/B test need to be?: Depends on your baseline conversion rate and the minimum detectable effect you want to find. Rough rule: n per variant ≈ 16 × p(1-p) / (MDE × p)². A welcome email with 25% open rate wanting 5% relative lift needs ~19,000 users per variant. The sample size guide covers the full calculation.
What's the difference between an A/B test and a holdout?: An A/B test compares two treatments against each other (subject line A vs B). A holdout compares treatment vs no treatment (program on vs program off). You use A/B tests to optimise within a program; you use holdouts to measure whether the program is actually producing incremental lift vs just capturing users who'd have converted anyway.
How do I measure attribution for lifecycle marketing?: Depends on the question. 'Did this specific email drive the sale?' — last-click. 'What's the ROI of the whole lifecycle function?' — multi-touch + holdout. 'Which channel gets credit for new customers?' — first-touch or data-driven. Using one model to answer all three produces wrong answers consistently.
What's the most common measurement mistake?: Reporting last-click revenue as incremental revenue. Cart abandonment gets credited with purchases that would have happened anyway. Winback gets credited with users who'd have returned organically. The gap between last-click and incremental is typically 40–70% — meaningful. Holdouts close the gap.
How long should I run a test?: Pre-commit to a sample size, not a duration. Running 'until significant' inflates false-positive rate substantially. Calculate the needed sample, run until the sample is reached, then read. If you need the option to stop early, use a platform with sequential-testing statistics (Optimizely, VWO) that corrects for the peeking.
Does send-time optimisation actually work?: Modestly — 3–8% open rate lift on broadcasts, little to no revenue lift. Vendor case studies claim 20–40%+; measured reality is smaller because much of the claimed lift is Apple MPP inflation rather than real engagement. Worth enabling for broadcasts at scale; skip for triggered sends and small audiences.

Guides