Updated · 9 min read
Incrementality testing: the measurement that tells you if a program actually works
Your win-back flow reports $2M in attributed revenue. Your CFO asks whether those users would have come back regardless. If you can't answer with evidence, the $2M evaporates on contact. Incrementality testing — comparing what happens with your program against what happens without it on otherwise-identical users — is how you answer. It produces the most defensible revenue number lifecycle can put on a slide. This is the design.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Your dashboard is taking credit for things it didn't do
Picture how most lifecycle programs report success. A dormant user gets a win-back email on Tuesday. They click. They buy on Thursday. A reporting model called last-click attribution — assigning revenue to whatever the user touched most recently before purchasing — credits the email. Tidy story. Easy to put in a slide.
Trouble is the implicit claim underneath the story: without that email, the user wouldn't have bought. Wrong often enough to matter. A meaningful share of those dormant users were already drifting back — checking the app, opening a competitor's tab, half-thinking about a purchase. Your email caught a wave that was already cresting. Your program did something. Just nothing like what your dashboard says.
Attributed revenue and incremental revenue — the bit your program actually caused — usually differ by 2–4x. A program attributed $2M typically produces $500K–$1M in real lift. The rest is re-credit of behaviour that would have happened on its own.
Incrementality testing fixes this. Cleanest way to picture it: a clinical drug trial. Half the patients get the drug, half get a sugar pill — the placebo arm — and whatever difference shows up between the two groups is what the drug actually did. Lifecycle works the same way. A random fraction of eligible users — the holdout, your placebo arm — gets no program. Everyone else gets it as normal. Whatever difference shows up in their downstream behaviour is the lift — the slice of revenue your program genuinely caused rather than merely sat near.
Attribution still earns its keep for daily operations. Incrementality calibrates it. Run attribution as the daily pulse, run incrementality as the annual or quarterly recalibration. Different instruments, different questions.
The five-step design — and why this isn't an A/B test
One distinction worth nailing first. An A/B test compares one variant of a message against another — subject line A versus subject line B, both groups receive the program. It tells you which version wins. An incrementality test compares with program against no program. It tells you whether the program is worth running at all. Both are useful. Each answers a different question, and people confuse them constantly.
Core design follows the clinical-trial scaffolding directly. Five steps, in order:
1. Define the program. The specific sequence, trigger, or treatment being tested. "The 3-message win-back flow triggered at 60 days of dormancy." Not "winback" as a vague concept — name the exact artefact going live.
2. Define the eligible population. Users who would normally qualify. "Users dormant exactly 60 days and not unsubscribed." Write it as if describing a SQL filter — the database query that pulls these users out — because at some point you will be.
3. Random assignment to treatment versus control. Usually 80/20 or 90/10 treatment-to-control. Treatment gets the program, control gets nothing. Random is load-bearing — both groups need to be statistically interchangeable on everything except the program itself, the way a clinical trial randomises patients to drug versus placebo. Everything else stays identical for both.
4. Pre-register the outcome metric and measurement window. "Purchase rate within 30 days of trigger." Decide before the test starts. Write it down. Reasons matter and we'll get to them.
5. Measure. Treatment minus control. That difference is the incremental lift. A proportion test or t-test — standard statistical checks for whether two groups genuinely differ — tells you if it's significant or noise.
One technical principle that matters once you're running this in the real world: intent-to-treat. Anyone assigned to treatment counts as treatment, even if they never opened the email. Anyone in the control counts as control. You don't exclude people who didn't engage — that's how teams accidentally over-credit their program by quietly removing the users it failed to reach. Randomisation is what makes the comparison fair. Don't break it after the fact.
Sizing the holdout — and selling it to the people who hate it
Your holdout — the placebo arm, users who get nothing — needs to be large enough to detect the lift you expect to see. Usually 5–20% of eligible users, sized to the effect.
Expected lift 5% or more (strong programs): 5% holdout is usually enough.
Expected lift 2–5% (typical): 10–15% holdout.
Expected lift under 2% (marginal): 20%+ holdout. While you're at it, consider whether the program is earning its place at all.
Maths is in the sample size guide; same formulas apply here. One thing to hold in mind: your control group is the bottleneck. Treatment runs at full volume; control size determines statistical power — the ability to detect a real effect rather than miss it as noise.
The harder problem is human, not statistical. Someone in finance or growth will look at the holdout and see lost revenue: 10% of eligible users not getting the program means 10% of the program's sends not happening. Framing that works is risk management, not measurement cost. A program with no incrementality number has unknown value — nobody can tell whether trimming 10% of the budget hurts anything or saves money. A program with a measured lift produces a defensible figure that survives skeptical questioning. Those 10–15% "lost" sends are the price of measurement, and the figure they buy you is usually worth far more in budget defensibility than the sends themselves would have generated.
How long to run it — and the number your CFO actually wants
Run the test long enough that your measured effect represents real lift, not a short-term shift in timing. Right duration is set by the metric, not the calendar:
Short effects (purchase within 7 days): 2–4 weeks of assignment, then analyse.
Medium effects (purchase within 30 days): 4–8 weeks.
Long effects (retention, subscription renewal): 3–6 months.
Now the numbers themselves. Headline figure is revenue per user: total revenue in the window divided by users in each group, treatment versus control. Your CFO wants a per-head impact they can multiply by your eligible population to size the program's genuine contribution. Secondary is conversion rate, the percentage making any purchase, which tells you whether lift is driven by more buyers or bigger baskets. Collect program-level engagement (opens, clicks, attributed conversions) for context, but don't let attribution numbers confuse the incrementality read — they measure different things on purpose.
Formula: (revenue/user in treatment − revenue/user in control) / revenue/user in control. A 5% incremental lift means the program genuinely added 5% to per-user revenue over what would have happened without it.
Where this falls apart — and the benchmarks worth comparing against
Control contamination. Holdout users still receiving the program through another channel — a push notification, an in-app banner, a Meta ad. Audit every trigger and suppression rule. Audit whether other programs overlap with the one under test. If your placebo arm is secretly getting the drug through a side door, the test reads zero lift even when the program works.
External events. A product launch, sale, or news cycle during the test affects both groups — but unevenly if the groups are already behaviourally different. Balance on pre-test characteristics where possible (purchase history, tenure, segment) so both arms are genuinely matched at the start.
Measurement window too short. Measuring 7-day purchase rate on a program that drives long-term retention misses the effect entirely. Match your window to the mechanism — what is the program actually trying to do, and how long does that effect take to show up?
Post-hoc metric shopping. A win-back flow might not lift 30-day purchase rate but might cut unsubscribe rate. Pre-register what you'll measure. Picking the winning metric after you've seen results is how teams convince themselves of things that aren't true. Same discipline pre-registration enforces in clinical trials, and for the same reason.
Rough benchmarks for a healthy incrementality number, by program type, so you have something to compare against. Welcome flow: 5–15%. Win-back: 3–10%. Cart abandonment: 8–20%. Browse abandonment: 5–15%. Newsletters: 1–3% directly plus indirect effects. Your program against those bands tells you where it stands. Absolute numbers drift with audience and category, so treat the bands as orientation, not target.
One more thing worth saying plainly. Skip incrementality testing on transactional mail — password resets, order confirmations, shipping notifications. Nobody is holding back a password reset for measurement, and an order confirmation isn't earning its keep on incremental revenue anyway. Transactional value is functional, not promotional. Incrementality testing belongs in lifecycle programs where there's a genuine choice about whether to send.
If your measured lift comes back smaller than expected, treat that as information, not failure. Either the program is less impactful than attribution claimed (common; trim budget or rebuild), or the test was underpowered or contaminated (audit methodology). Small or null incrementality on programs "everyone knows work" is the single most common finding in this entire discipline — and, unsurprisingly, the most valuable one. Whatever number survives the holdout is the number you can defend.
covers incrementality as a quarterly practice. Every major lifecycle program deserves at least an annual incrementality read to calibrate the attribution number it's been quietly generating all year.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
Frequently asked questions
- What is incrementality testing?
- A measurement approach where a random fraction of eligible users are held out of a lifecycle program entirely — the placebo arm. The difference in outcome metric (revenue per user, retention, LTV) between exposed and holdout users is the incremental impact: what the program actually caused, vs what would have happened anyway. Without incrementality testing, attributed revenue mixes program contribution with organic behaviour, typically overstating program impact by 2-5x.
- How is incrementality testing different from A/B testing?
- A/B tests within the exposed population (variant A vs variant B, both receive the program). Incrementality tests across exposure (program vs no program). A/B answers "is this variant better?"; incrementality answers "is this program creating value vs the baseline?". Every lifecycle program should have both layered: incrementality to prove the program exists, A/B to tune it internally.
- How long should an incrementality test run?
- Long enough for the outcome metric to accumulate across the expected decision cycle. For a winback program measuring 90-day revenue per user: run for 90 days at minimum. For a welcome program measuring day-30 activation: run for 30 days. Cutting short because "it looks significant early" is the peeking mistake that invalidates results. Pre-compute the measurement window based on the metric's observation cycle and commit to it.
- What metric should incrementality testing measure?
- The business outcome the program claims to move — usually revenue per user, retention rate, or LTV. Measuring engagement (opens, clicks) isn't incrementality; exposed users engage more trivially because they receive mail, but that engagement may not translate to revenue. Always measure the downstream business metric, not the program's direct output. "Our winback drives 3% more opens" is trivial. "Our winback drives $X more revenue per dormant user" is actionable.
- When should I run incrementality testing?
- Once per major program at launch (prove it works vs nothing), then annually to verify the program is still generating value (programs degrade as audiences saturate or competitive dynamics shift). Running incrementality continuously on every program is too expensive — it permanently holds back revenue. Running it never means you're flying blind on whether the programs you fund actually create value. Annual cadence, on the programs representing the largest share of CRM spend, is the operator standard.
This guide is backed by an Orbit skill
Related guides
Browse allPrice-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Segment-based testing: when your average lift is hiding opposing effects
A winning A/B test with 4% aggregate lift might be a 20% win in one segment and a 10% loss in another. The aggregate is an average of opposing effects. Segment analysis catches it — and lets you ship the win to the segments that benefit while not shipping the loss to the ones that don't.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.