Updated · 9 min read
Price-testing through email: what's testable, what isn't
A merchandiser, a CFO, or a head of growth has an idea for a new price, discount, or offer structure. Email is the quickest way to put it in front of users — fastest to build, fastest to read. The test ships. A winner is declared. Three months later the winning offer has been folded into the base program and nobody can reproduce the lift. This pattern is common enough to be the default expectation. Email price tests are uniquely prone to false positives — short reads, biased audiences, costs that hide outside the test window — and uniquely able to damage the program when the wrong lesson gets locked in.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
The pattern: a winner that nobody can reproduce
A winner is declared. Three months later the winning offer has been rolled into the base program and nobody can reproduce the lift. This pattern is so common it should be the default expectation.
Before getting into what email can and can't test, it helps to name the failure mode. A team runs a price test — say, 20% off vs the standard 10% off — through their lifecycle program (the automated, behaviour-triggered emails that follow signup, browsing, abandonment, and so on). The 20% variant wins on conversion. The team rolls it into the always-on cart sequence. Three months later, revenue is flat or down, and nobody can point to where the lift went.
The lift was real inside the test. It just wasn't durable, and it wasn't the same shape once it left the test. Most of this guide is about why that gap opens, and how to close it before you ship.
Email is a credible test environment for a narrow set of pricing questions. Whether offer X (20% off, one product) converts better than offer Y at the same message moment. Whether free-shipping framing beats dollar-off framing at the same effective discount. Whether a deadline, a free-trial extension, or a price-lock message outperforms the control version of the same campaign.
The common thread: each test is a single message moment, a narrow audience, and a conversion window that closes quickly. Inside those limits, email tests can be rigorous. Power them correctly. Read them cleanly. Ship the winner without risking the rest of the program.
What email cannot test: whether the underlying price should change. Whether the product is worth more or less than its current tag. Whether a subscription tier is correctly positioned. Those are pricing questions, not messaging questions. They need audiences, durations, and measurement machinery that an email test cannot give you. Treat email as one data point on those decisions, never the final answer.
Why the same three traps catch nearly every price test
Three mechanical problems push email price tests toward false positives — wins on the dashboard that don't survive contact with the rest of the program — at a rate much higher than other email tests. Each one is fixable once you know to look for it. Most teams don't look.
Novelty effect, amplified.Novelty effect is the temporary engagement bump anything new gets when it lands in someone's inbox, before regression to the mean kicks in. Price-related copy is unusual in a lifecycle program — most emails don't lead with a number — so when a variant does, engagement spikes on novelty alone for a few days. On a standard two-week test, that novelty window can carry enough of the measurement period that the variant looks like a winner even after it fades.
Audience selection bias. Selection bias here means your test is reading a different population than your rollout will hit. Price tests in email tend to reach only the opened-the-email cohort, which is dramatically more engaged than the full audience. A discount that converts well for an engaged cohort will over-predict performance when rolled out to the full base. Measure conversion per sent (everyone who got the message), not per opened (the engaged subset), and confirm the test population matches the rollout population. Both steps are essential. Both are frequently skipped.
Cannibalisation.Cannibalisation is when the variant's lift comes from pulling forward conversions that would have happened anyway, rather than creating new ones. The lift looks real inside the test window — those users did convert on the variant — but it evaporates at program level, because the converting users would have converted at the control offer a week later. A 20% conversion lift that consists entirely of pulled-forward conversions is a timing change, not a revenue gain. Short-window tests rarely catch this; the post-test slump shows up after the readout has already been written.
The Orbit Experiment Design skill builds cannibalisation checks into the readout — holdout comparisons (a slice of the audience that gets nothing, used as the true baseline), payback-window modelling (how long until the discount earns back its margin) — so lifts that wash out at program level are flagged instead of shipped. The underlying experimentation discipline sits in the A/B testing guide — sample size, statistical power, novelty.
The number on the dashboard isn't the question
The conversion-lift number you read off the dashboard at end of test isn't the question that actually matters. It's the easiest thing to compute, which is why it gets shipped to leadership, which is why teams optimise for it. Three pieces, all of them usually missing, separate the dashboard number from the real one.
The question that matters is not "did this variant convert better". It's "did this variant produce more revenue than the control, net of the discount, measured over a period that captures what happened to the users who converted".
Net of the discount. A 20% discount that lifts conversion by 15% usually loses money at a unit level — you handed over more margin than you bought in incremental volume. Almost every conversion-lift number reads differently once you subtract the margin handed over to produce it. The dashboard rarely does that subtraction for you.
Over the right period.A 7-day conversion window — the standard end-of-test readout — on a price test misses retention and repurchase effects. Users who converted on a steep discount often retain worse than users who paid full price. They bought the price, not the product, and they leave when the price returns. Measure a 30 to 90-day window, or accept you're optimising for an intermediate metric rather than revenue. Most teams read the 7-day number and miss the 90-day reversal entirely.
Against the right counterfactual.A counterfactual is what would have happened without the test — the baseline you're comparing against. A proper price-test measurement needs a holdout: a random slice of the eligible audience that gets no offer at all, so you can see what they would have done. Without one, the test answers "variant vs control offer", which is a weaker question than "variant vs no offer". Plenty of discount campaigns beat their control variant while losing to the holdout, which is the moment you realise the whole thing was self-cannibalisation with extra steps.
The safer half: testing how the offer is framed, not what it costs
Not every test that touches price is dangerous. There's a specific subset that's actually well-suited to email: copy-level framing around a fixed underlying offer.
"Save $20" vs "20% off" at the same effective discount. "Limited time — ends Sunday" vs no deadline. "Your exclusive offer" vs generic framing. These are legitimate email tests because the underlying economics are identical — only the framing moves. Which means most of the traps above don't apply: there's no margin difference to net out, no unusual audience selection, no cannibalisation pulling forward different volumes of conversion. Use the significance calculator as normal and ship the winner.
Offer-level tests — changing the discount, the product mix, or the price tiers — can still run through email, but treat the result as first evidence, not verdict. Pair it with holdout data, a 30-plus-day retention window, and explicit margin accounting before anything gets declared a winner.
When the right answer is to not run the test
Two common situations where the honest answer is: don't run it through email at all. Knowing when not to test is part of the discipline.
The test can't answer the question you're asking.Statistical power — the test's ability to detect an effect of a given size — depends on audience volume per variant and how big a difference you're trying to spot. If your audience per variant is 5,000 and the effect that matters is a 3% lift, the test mathematically cannot answer the question. You'll get a number. It will not be signal. Any decision based on it is a coin flip wearing statistical vocabulary.
The winning condition would damage the program.An aggressive discount variant that wins in email trains your audience to wait for discounts — they learn the brand always caves — and that training cost shows up months later as suppressed full-price conversion. It's hard to attribute back to the test that caused it, which is exactly why it keeps happening. Run these only if you're ready to either ship the winner everywhere or accept the training effect. If neither is acceptable, don't run the test.
The Retention Economics skill models the downstream cost of discount-trained behaviour so it enters the decision explicitly instead of ambushing you in six months.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
This guide is backed by an Orbit skill
Related guides
Browse allA/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP markets an STO feature and every vendor deck shows lift. The honest version: STO moves open rate 3–8%, rarely revenue, and only for certain program types. Here's when it's worth turning on.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle look bigger than it is. Incrementality testing strips out users who would have converted anyway and surfaces the real number. This is how to design a test that produces a figure you can defend in front of a CFO.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.