Updated · 8 min read
Segment-based testing: when your average lift is hiding opposing effects
Picture the morning after a clean A/B test — variant A versus variant B, randomly split between two halves of your list, the standard way of comparing two versions of an email. Variant B won by 4%. p=0.03 (the statistical confidence number; under 0.05 is the convention for "real result, not noise"). Ship it, right? Then you slice the data: variant B was a 20% win among new users and a 10% loss among long-tenured ones. The 4% headline was a weighted average of opposite effects. This guide is about catching that before you ship the loss along with the win.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
The 4% lift that's actually two opposite results stitched together
Picture the standard test readout — the post-test summary your tool spits out the morning after. One number on top: variant B lifted opens 4%. Significance bar green. Done. Ship it. The thing that single number quietly buries is that your list isn't one audience. It's several — new versus returning, high-value versus low-value, engaged versus dormant — and a single change often hits each of those groups very differently.
Take a "limited time" subject line. New users — people who joined in the last 30 days — might respond well; they don't know your cadence yet, so the urgency reads as real. Long-tenured users — twelve months in or longer — might respond worse; they've seen "limited time" enough times that it reads as manipulative. The aggregate lift is those two effects partly cancelling, with a sliver left over.
Or a longer email versus a shorter one. High-intent users — the ones already shopping — engage more with the longer version because they want detail. Casual browsers engage less because they want to scan. Aggregate is the net. Segment effects are pointing in opposite directions. Ship the aggregate winner to everyone and you ship the loss alongside the win.
The "winner" of an A/B test is usually a winner for some users and a loser for others. Treating it as a universal best practice ships the loss along with the win. Segment analysis lets you keep the win and skip the loss.
Five slices of your list worth looking at every time
Run the main test across all users — full population, randomly split, the way every test is normally designed. Once it reaches significance, slice the result by these five dimensions. Three to five cuts is the right number; go further and you're fishing — running enough comparisons that one of them is bound to look interesting by chance.
Tenure. New (under 30 days on the list) versus established (over 90 days) versus tenured (over 12 months). The most common, most informative split. Different tenure cohorts behave differently almost every time — fresh subscribers haven't built tolerance to your patterns yet, and that gap is where the action sits.
Engagement level. Highly engaged (opened 50%+ of the last 10 emails) versus low engaged. Engaged users respond to different signals than occasional ones — often the longer, denser message wins on the engaged cut and loses on the casual one.
Purchase history. First-time versus repeat buyers. Different offers and tones work for different phases of the customer relationship. A discount that excites a new buyer can feel insulting to a loyal one.
Device or email client. Mobile-dominant versus desktop. Often reveals that "winning" variants only actually win on one of the two — and the other is carrying the loss quietly.
Acquisition channel. Paid social versus organic versus referral. Channel correlates with user profile, which correlates with response patterns. A subject line that resonates with TikTok signups can fall flat on people who came through SEO.
How many cuts is the right number? Three to five meaningful ones, run after the test concludes — what statisticians call post-hoc analysis (analysis you do after the fact rather than committing to in advance). The test itself needs to be powered for the aggregate; segment cuts are exploration, not part of the original design. If a specific segment hypothesis is genuinely worth acting on, run a follow-up test pre-registered — declared in advance — with adequate sample size for that segment alone.
Walking the segment cut, step by step
Once your test has hit its pre-committed sample size — the number of users you decided up front would be enough to call the result — work through this in order:
1. Check the aggregate. Winner, lift, significance. Standard step. Don't skip it just because you know about segments now.
2. For each meaningful segment, re-run the same comparison restricted to just that group. What's the lift? Is it significant at a tighter threshold (more on that below)? Is the effect size — the magnitude of the lift — similar to the aggregate, or different?
3. Flag any segment where the effect size differs meaningfully from the aggregate. A segment with +20% lift when the aggregate is +5% is interesting. A segment with -10% when the aggregate is +5% is extremely interesting — and is the entire reason segment analysis exists as a discipline.
4. For cross-cutting opposing effects, ask the harder question: is the aggregate winner really the right answer for the whole list, or should different segments get different treatments?
,
How to defend against segment false positives in practice: tighter significance thresholds (p=0.01 or Bonferroni-corrected); require the segment effect to be both significant and materially larger than the aggregate; validate any unexpected segment finding with a follow-up test specifically designed for that segment before acting on it.
What to do when the segments tell you different stories
When the segment effect agrees with the aggregate: the winner is a universal winner. Ship it to everyone. Move on.
When segments differ meaningfully: targeted ship. Winner to the segments that benefit, control (or further testing) for the segments that lost. Different treatments for different audiences based on what actually works for each — which is the whole point of having segments in the first place.
When the losing segment is the most valuable one: reconsider the aggregate "winner" entirely. A 4% headline lift that comes with a 10% loss among your high-LTV users — high lifetime value, the customers who'll spend the most over their relationship with you — might be net-negative on revenue once you weight by what each user is worth. Re-run the analysis weighted by user value, not user volume. This is where "we shipped the winner" and "we shipped the wrong thing" become the same decision.
Should you ship a winner when one segment lost? Depends on the segment. Small, low-value segment with a modest loss? Ship the aggregate winner and move on. Large or high-value segment with a substantial loss? Targeted ship: winner to the benefiting segments, control for the losing one. The worst option is to ship the aggregate "winner" universally and pretend the segment-level cost didn't exist. The VIP lifecycle guide covers why high-value segments often need different treatment than programs optimised for the average user.
The same patterns show up across almost every program
Some segment patterns repeat so often they're worth pre-empting. Watch for these and you'll spot them faster.
New users love urgency; tenured users are tired of it. Urgency tactics that lift new-user conversion often reduce tenured-user engagement. Targeted ship: urgency for new, restraint for tenured.
Engaged users tolerate more frequency; dormant users don't. More email lifts revenue from engaged segments and accelerates unsubscribes from dormant ones. Higher cadence for engaged, lower or paused for dormant — which is also the cadence guide's core point.
Mobile wants short; desktop tolerates long. Long-form emails test well on desktop and worse on mobile. Rarely worth a device-specific ship — design for mobile-first (see the mobile design guide) and make sure desktop still works.
Repeat buyers respond to personalisation; first-time buyers respond to social proof. A repeat buyer wants to feel known; a first-timer wants reassurance that others bought successfully. Same campaign, different hero copy by segment.
A practical rule: for any test that reaches aggregate significance, run the segment slice. It takes minutes and often surfaces the real story. For null-result tests — ones that didn't produce a clear winner — segment analysis is less urgent. A null aggregate usually means null segments too, with the occasional exception where a real segment effect was hidden by an offsetting effect somewhere else. Small-list programs (under, say, a few thousand subscribers per cell — "cell" meaning each variant's allocated audience)? Segment analysis becomes largely descriptive — directional signal, not statistically conclusive. Use it to generate hypotheses for future tests rather than to make ship/no-ship calls on underpowered slices.
treats segment analysis as a standard post-test step, not an optional extra. Every significant test gets at least one segment-level slice reviewed before shipping. The cases where segments behave differently from the aggregate are some of the highest-value findings a program produces — and they're sitting one query away from the test you already ran.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
This guide is backed by an Orbit skill
Related guides
Browse allSample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
Price-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle look bigger than it is. Incrementality testing strips out users who would have converted anyway and surfaces the real number. This is how to design a test that produces a figure you can defend in front of a CFO.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.