Updated · 9 min read
Holdout group design: the incrementality tool most lifecycle programs skip
A holdout is the lifecycle equivalent of a clinical trial's placebo arm: a random slice of your audience that gets no marketing messages for a measurement period, sitting there as the baseline you can compare your messaged audience against. The revenue gap between the two groups is incremental lift — the bit your program actually produced versus the bit that was coming anyway. It's the most defensible measurement in lifecycle marketing. It's also the most skipped. This guide walks through how to size one, run one, and read one — the way it'll hold up when finance starts pulling threads.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Picture the problem first: who would have bought anyway?
Attribution tells you which touchpoint got credit. Incrementality tells you whether the touchpoint produced revenue. Two different questions. Two different answers.
Imagine you send a winback email — a message designed to bring back lapsed users — to 100,000 dormant customers, and 3,000 of them come back and buy. Lovely number, ships well in a deck. But the question finance will eventually ask is the awkward one: how many of those 3,000 were going to come back anyway, with or without the email? If the answer is 2,800, your program produced 200 incremental customers, not 3,000. The deck stops shipping quite so well.
This is the gap a holdout closes. Attribution models — first-touch, last-touch, multi-touch (different rules for which marketing touchpoint gets credit when a sale happens) — divide credit across touchpoints that occurred before a conversion. They do not answer the causal question: would the conversion have happened without any of those touchpoints? For lifecycle, the honest answer is "yes, for a meaningful share of revenue." Users who would have returned, renewed, or bought anyway show up in attribution reports as lifecycle wins. Nobody's fault, exactly. Nobody's telling you either.
A holdout strips the confusion at the root. Random assignment — splitting users by something like a coin flip rather than by who they are or how they behave — means the holdout group is, on average, identical to the messaged group on every dimension except message exposure. The revenue gap is incremental by construction. No attribution model required. No debate with finance about which model is "right."
How big should the holdout be?
5–10% of the eligible audience is the operator default, and for good reasons. 10% reaches statistical power— the test's ability to detect a real effect rather than missing it as noise — faster. 5% loses less revenue while the test runs, in the case where the program turns out to be a winner. For a program with tight monthly revenue targets, start at 5%. For a program where you're genuinely testing whether to keep running it, 10% gets you to the answer sooner.
The holdout has to be large enough to detect the effect size — how big the difference between the two groups needs to be before you trust it isn't random — that you actually care about. Ballpark: to detect a 10% incremental lift at 95% confidence (the standard threshold for "this almost certainly isn't a fluke") over a quarter, you typically need around 5,000 users in the holdout. The exact number moves with your baseline conversion rate. Below 5,000 the test is underpowered, and you'll land on a number that isn't reliably different from zero — which is worse than running no test at all, because people will still quote it.
The Orbit Experiment Design skillhandles the power calculation against your specific baseline and expected lift. Skip that step and you're running a test that can't answer the question it was built to ask.
Three rules for assignment that keep the result honest
How you put users into the holdout matters more than most teams realise. All three of the rules below can break silently — the test will keep running, the dashboard will keep producing numbers — and you won't find out until someone asks a hard question in a quarterly business review.
Stable assignment. A user assigned to the holdout today stays in the holdout for the full measurement period. Users flickering in and out of the holdout contaminate the read — you're no longer comparing two clean groups, you're comparing two blurry ones. Use a persistent random integer — a number permanently attached to each user that doesn't change over time. Braze's Random Bucket Number(a number from 0–9999 assigned once per user and never reshuffled) is built for exactly this. Don't use a random value that recalculates on every audience refresh.
Random assignment. "Users who haven't engaged in the last 30 days" is not a random cut. It's selection bias— picking users by a trait that already correlates with the outcome you're measuring — with a severity rating. The holdout has to be statistically equivalent to the treatment group on every dimension except message exposure. Random Bucket Number or equivalent. No shortcuts, no "but we only want to hold out the engaged users."
Global assignment. Exempting specific programs from the holdout — "we won't hold out onboarding because it feels cruel" — compromises the measurement. Either you hold out, or you don't. A global holdoutmeans every marketing send respects the same exclusion list. Transactional sends — order receipts, password resets, the messages you'd send regardless of marketing — are exempt, obviously. Marketing isn't.
The three mistakes that invalidate the result
Mistake 1 — holdout leakage. Users in the holdout occasionally get mail because a broadcast ignored the flag — a one-off newsletter that bypassed the audience filter, a re-engagement campaign that pulled from a different list. Two percent leakage is enough to invalidate the measurement.You're no longer comparing messaged versus unmessaged — you're comparing heavily-messaged versus lightly-messaged, and the gap shrinks accordingly. Audit broadcasts monthly for holdout compliance. Every month. No exceptions.
Mistake 2 — seasonal confounds. A confoundis something other than the message that changes the outcome — and Black Friday is the obvious one. A holdout that runs only in November will show enormous incremental lift, because the volume spike does the heavy lifting, not your emails. The number won't generalise to the rest of the year, but it'll get quoted as if it does. Run holdouts across full quarters or full years so you average seasonal effects out.
Mistake 3 — reading before statistical power.A two-week holdout result is almost never significant — there isn't enough volume yet for the gap between groups to be distinguishable from noise. Leadership asks for an update, the team produces a number anyway, and that number gets cited as the incrementality forever. The fix is simple and politically hard: don't publish interim reads. Publish once at the end of the measurement period, with the full analysis attached.
Reading the number — and what to do when it's zero
A holdout produces a single most-important number: incremental revenue per user, the average extra dollars each messaged user generated above and beyond the holdout baseline. Multiply that by audience size for total program contribution. Divide by program cost for true ROI. This is the number that goes in front of finance and replaces the attribution-model figures that always, eventually, get questioned.
The retention economics guide covers how to frame this number in a CFO conversation. A defensible quarterly holdout study is usually more persuasive than six quarters of attribution spreadsheets, because it answers the causal question rather than the correlational one. Run one annually at minimum. Programs that run holdouts annually have budget conversations that go differently from programs that don't.
On the mechanics question that comes up constantly — in Braze, use Random Bucket Number filters. A fixed slice (RBN < 500 for a 5% holdout) is stable, random, and respected across every program in the instance. The same attribute that underpins IP warm-up is what you want here. It was designed for this job.
What if the holdout shows zero incremental lift? Don't panic. Don't bury it either. Zero is honest information, and it's worth investigating before it turns into a narrative. Is the program targeting users who were going to convert anyway? Is the offer too weak to change behaviour? Is the timing wrong? Zero is rarely the final answer on a well-designed program — but when it is, the program needs rethinking, not another quarter of the same cadence with a redesigned header image.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
Frequently asked questions
- What is a holdout group in email marketing?
- A randomly-selected share of the eligible audience explicitly excluded from a lifecycle program — they never receive the communications. Comparing their outcomes to exposed users reveals the program's true causal impact. Typical holdout size: 5-10% of eligible users for mature programs, 15-20% for programs still being validated. Holdouts are rotated quarterly so individual users don't sit in holdout forever.
- How big should a holdout group be?
- Big enough for statistical power on the primary metric, small enough not to waste revenue. For most lifecycle programs: 10% holdout gives adequate power within 4-8 weeks for primary metrics like revenue per user or retention rate. Smaller holdouts (5%) need longer measurement windows. Larger holdouts (20%+) measure faster but forgo more revenue during the holdout window. Pick based on volume: high-traffic programs can afford 5%; low-volume programs need 15-20%.
- How is a holdout group different from an A/B test control?
- Both randomly-assign users. Difference: A/B control receives the current treatment (not nothing); holdout receives nothing. A/B measures "variant vs current state"; holdout measures "program vs no program." A/B tells you if a change is better than today; holdout tells you if the program itself is creating value vs organic baseline. Both are essential at different decision points — A/B for tuning programs, holdout for justifying program existence.
- Should holdouts be permanent or rotated?
- Rotated, usually quarterly. Permanent holdouts create two problems: the held-out cohort diverges from the exposed cohort over time in ways that contaminate comparison, and the operator accumulates guilt about the "forever-missing-out" users. Quarterly rotation gives each user a fair chance at exposure and keeps the comparison clean by ensuring holdout and exposed cohorts stay demographically similar over time.
- How do I communicate holdout results to leadership?
- Report the incremental metric directly: "Our winback program drives $X incremental revenue per dormant user per quarter versus the no-send baseline." That single number is the business case. Avoid reporting "lift vs control" (which can be inflated by differences that existed before the program started) or cumulative revenue from program-exposed users (which mixes program contribution with organic behaviour). Incrementality vs holdout is the honest number.
This guide is backed by an Orbit skill
Related guides
Browse allPrice-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
Measuring AI personalisation lift honestly
Every vendor case study shows AI personalisation moving the numbers. Most internal post-mortems show the lift evaporating once a proper holdout is in place. The gap between the two is the measurement methodology. Here's the framework for proving — to yourself, your CFO, and the auditor — whether AI personalisation is actually earning its place.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle look bigger than it is. Incrementality testing strips out users who would have converted anyway and surfaces the real number. This is how to design a test that produces a figure you can defend in front of a CFO.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Attribution models for lifecycle: which one to defend in which room
Attribution debates are half epistemology, half politics. Last-touch is wrong but defensible. Multi-touch is more accurate but less defensible. Incrementality is the only one that answers the causal question — and it's the slowest. Here's which model to use for which question, and why.
Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.