Updated · 8 min read
Measuring AI personalisation lift honestly
Picture the demo. A vendor walks you through their AI personalisation case study — software that picks who gets which message, when, with what content, based on a model rather than a hand-built rule. Charts go up and to the right, lift figures (the percentage improvement over a comparison group) in the double digits, the logo of a brand you recognise. Six months later you're the one running it and your readout is flat. That's the gap this guide is about: the difference between vendor numbers and the numbers you get when you measure your own programme honestly. The defaults don't help you. Open rates have been broken since Apple started pre-fetching emails in 2021. Self-reported attribution flatters the model. Vendor case studies only feature the customers who saw lift. This is the inverted version — the measurement framework that survives an honest audit and tells you whether to keep paying for the feature.

By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why your last lift report was probably fiction
If your AI personalisation lift report can't survive the question "against what control group, over what period, on what metric?" — it isn't a lift report. It's vendor marketing with your logo on it.
Three failure modes account for almost every flattering AI personalisation report that doesn't hold up under inspection. They sound technical the first time you meet them. They're actually quite simple — each one is a way of comparing the wrong things and calling the difference progress.
Apple MPP open inflation. Apple Mail Privacy Protection — the iPhone Mail feature that pre-fetches images on Apple's servers before the user opens the email — registers an open whether or not a human ever looked. The user could be asleep. The user could be in a tunnel. Still counts as an open. So when an AI personalisation feature reports "open rate lift", what it's mostly measuring on Apple addresses is Apple's servers doing their job. The Apple MPP guide covers the mechanics. Practical implication: opens are not a valid metric for AI personalisation lift. Full stop.
Selection bias in vendor case studies. The customers who appear in the vendor deck are the ones who saw lift. Customers who didn't are not in the deck. So the base rate of "programmes where this feature produced no lift" is invisible to the buyer — the deck is selection-biased to success stories by construction. Practical implication: vendor benchmarks are upper bounds, not expected values. Plan for a fraction of the headline number and let your own data correct you.
Confounded comparisons. The AI version of a programme is compared to a previous version sent at a different time, to a different audience, with a different offer. Any one of those differences — a confound, anything that varies alongside the thing you're trying to measure — produces "lift" that has nothing to do with the AI. The summer numbers always beat the spring numbers in retail. That doesn't mean your model worked. Practical implication: lift requires a proper randomised holdout (a control group of users, picked at random, who receive the non-AI version) or it isn't lift, it's seasonality wearing a hat.
The four rules that turn a guess into evidence
Real measurement isn't complicated. It's four rules, each one closing off one of the ways the previous section's mistakes creep in. Run all four and the result is something you can defend in a board meeting. Skip one and you're back in vendor-deck territory.
Rule 1 — Hold out a random sample. A holdout is the slice of the eligible audience that does not receive the AI version of the programme — they get the non-AI experience instead, so you have something honest to compare against. Think of it like the placebo arm in a clinical trial. 10–20% is the usual range. The split must be random (not "everyone in California"), persistent (the same users stay in the holdout for the whole test, not reshuffled weekly), and large enough to detect a realistic lift size. The holdout group guide covers the design end to end.
Rule 2 — Measure on outcomes, not on signals. Conversion. Revenue per recipient. Retention. The actual things the programme exists to drive. Not opens (corrupted by Apple MPP), not clicks (a reasonable secondary metric, but a tell-tale rather than a verdict). If AI personalisation moves clicks but not conversion, that's the finding — the model is generating activity, not value. Worth knowing, not worth celebrating.
Rule 3 — Run the test long enough. Most AI personalisation effects are smaller than the natural week-to-week variance of conversion data, which means short tests pick up noise and call it a result. 30 days is the floor. 60 days is the realistic minimum for stable readouts on conversion or revenue. Programmes that declare lift after 7 days are reading weather, not climate. The sample size guide covers the calculation.
Rule 4 — Pre-register the readout. Pre-registration means writing down the metric, the success threshold, the test duration and the decision criteria before the test starts — and committing to honour them. It's the discipline that prevents the post-hoc "maybe if we look at clicks instead of conversion" reframe that turns null results into apparent lift. Treat AI personalisation tests like real experiments, because they are.
What honest lift actually looks like
Set the holdout up properly, measure on conversion, run for 60 days, and you'll see numbers that look quiet next to the vendor deck. That's the point — the vendor deck wasn't the truth. Here's the realistic range, by feature type, drawn from independent measurement and post-MPP studies.
Predictive churn save flows. A model identifies users likely to churn; a save flow tries to keep them. 5–15% reduction in churn rate within the targeted cohort, when the save flow itself is well-designed. The model picks the right people; the save flow does the actual saving. A great model with a weak save flow produces nothing.
Product or content recommendations. 3–8% lift in click-to-conversion in the targeted message slot. Bigger when the catalogue is large (1,000+ items) and the user has rich engagement history; smaller or zero when either is missing. AI recommendations have nothing to recommend on a thin catalogue.
Send-time optimisation. 3–8% open lift, 1–4% click lift, typically no significant revenue lift against a proper randomised holdout. The STO guide covers the honest version in detail.
AI subject-line generation. Variable, and the comparison matters. Compared to a human writing one subject line, generating five and A/B testing usually wins by 2–5% on click-through. Compared to a team that already runs disciplined subject-line testing, the model adds little — you were already doing the work it does.
Vendor case studies typically show 2–5x these numbers. The gap is methodology, not magic. Plan around the realistic range, treat the vendor figures as best case, and let the holdout tell you which it is.
The four mistakes that wreck almost every readout
If a readout looks too good, one of these four is usually why.
Comparing post-launch to pre-launch. "Conversion is up 12% since we turned on AI recommendations." This is a temporal comparison — before vs after — not a causal one. Conversion is up because of seasonality, a marketing campaign, a product launch, a competitor going down, the weather. The AI may have contributed; you cannot tell from this comparison.
Comparing AI segment vs non-AI segment. "Users in the AI cohort convert at 8%; users not in the AI cohort convert at 4%." Selection bias again — the AI cohort is the high-engagement users who entered the flow, the non-AI cohort is everyone else. Two different populations, not the same population randomly split. The AI didn't cause the lift; the segmentation did.
Reading individual metrics in isolation. Opens up, clicks flat, conversion down, unsubscribes up. The framing "opens are up 12%" is technically true and entirely wrong as a summary. Always read the metric stack together; AI personalisation often moves upstream metrics in directions that don't translate downstream.
Stopping tests when the chart looks good. Test runs 14 days, lift is 8%, "ship it." The 8% might be 3% by day 30 and 1% by day 60 as novelty effects (users engaging because the programme changed, regardless of whether the change was actually good) wear off. Pre-register the duration and honour it. The Experiment Design skill covers the discipline of finishing tests rather than stopping them when the chart looks pretty.
What you hand to the CFO when they ask
The point of the readout isn't to make the model look good. It's to make the decision easy. Six months later, when someone asks why you're still paying for the feature, you want to hand them a document, not summon a vibe. Here's the minimum.
The hypothesis. What was being tested, on what audience, against what metric, with what expected magnitude. If you couldn't falsify the hypothesis, it isn't one. It's a wish.
The holdout design. Sample size, randomisation method, test duration. If anything was non-standard — cluster randomisation (splitting by group rather than individual), post-stratification (rebalancing the cohorts after the fact) — explain why. Surprise statistical tricks in a readout are a red flag.
The result against the primary metric. Lift, confidence interval (the range the true number probably sits inside, given the test data), significance. The honest version, not the most flattering frame.
Secondary metrics and guardrails. Unsubscribes, spam complaints, any embarrassing model outputs that required intervention — the things that go wrong quietly while the primary metric goes right. An AI personalisation feature that lifts the primary metric but raises unsubscribe rate isn't a winner. It's a fast path to deliverability problems your CRM lead will be unpicking next quarter.
The recommendation. Keep, kill, iterate — with reasoning. Not "the model worked". "The model produced X% lift on the primary metric, no movement on guardrails, recommend expanding to programmes Y and Z under the same measurement protocol." A real call, defended by the data above it.
This kind of readout is durable. Skip it and AI personalisation features tend to accumulate as load-bearing infrastructure nobody can justify and nobody wants to be the one to turn off. The year-end audit is unpleasant for everyone.
The teams that get the most from AI personalisation aren't the most enthusiastic about the technology. They're the most disciplined about measuring it. The two often look identical in a deck and very different in a P&L. Pick the second one.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
Frequently asked questions
- What metric should I use to measure AI personalisation lift?
- The downstream business metric the program exists for: conversion, revenue per recipient, retention, expansion. Avoid opens (corrupted by Apple MPP) and treat clicks as secondary. If the AI moves upstream metrics but not downstream ones, that's a finding — the model is generating activity, not value. Programs measured on opens will keep buying AI features that don't earn their place.
- How big does the holdout need to be?
- Large enough to detect the realistic lift size with statistical power — meaning a high enough chance of spotting the effect if it's actually there. For typical AI personalisation lift (3–10% on conversion), a holdout of 10–20% of the eligible audience usually works for programs above 50K monthly recipients. Below 50K, the holdout has to be larger relative to the audience to maintain power, which means slower readouts. The sample size calculator linked from the guide covers the math.
- How long should an AI personalisation test run?
- 30 days minimum for upstream metrics; 60+ days for conversion or revenue. Shorter tests pick up novelty effects (users engaging because the program changed, regardless of AI quality) that fade. The discipline is to declare the duration before the test starts and honour it, even if the early data looks great.
- Are vendor case studies useful at all?
- Yes — as upper bounds on what's possible, and as a guide to which use cases the model has been tuned for. Useless as expected values for your specific program. The customers in the case study are not representative; they're the ones who saw lift. Plan around 1/3 to 1/2 of the case-study lift as a realistic range and let your holdout tell you the actual figure.
- What if my CFO wants the lift number?
- Give the holdout-validated lift on the primary metric, with the confidence interval, and explicitly note what's not included (no Apple MPP-affected metrics, no vendor self-reporting). A CFO trusts a smaller, defensible number more than a larger, fragile one — especially when next year's budget conversation requires defending the renewal of the AI personalisation contract. The honest readout protects the program.
This guide is backed by an Orbit skill
Related guides
Browse allSample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP markets an STO feature and every vendor deck shows lift. The honest version: STO moves open rate 3–8%, rarely revenue, and only for certain program types. Here's when it's worth turning on.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Price-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 63 lifecycle methodologies, 91 MCP tools, native Braze integration. Free for everyone.