Orbit web apps
Orbit web apps
Z-test for A/B/n experiment results. Compare a control against up to 9 variants.
A 'significant' result is not the same as a meaningful result. Most lifecycle A/B tests that declare a winner at p < 0.05 are either under-powered (false positives) or measuring a one-week novelty effect (not a real lift). Here's how to avoid both traps.
Significance answers one question: how likely is it that the difference you're seeing between variant and control is real vs random noise? A two-proportion z-test (the math behind the calculator above) gives you a p-value. p < 0.05 means there's a less-than-5% chance the difference is random.
95% confidence is the standard threshold but it's a convention, not a law. For a decision worth reversing later (a subject line test), 90% confidence is fine. For a decision that locks in a strategy for a year (an onboarding flow overhaul), demand 99% confidence — and verify it in a follow-up test before committing.
Under-powered tests produce unreliable winners. Sample size matters more than duration — an A/B test with 500 conversions per variant can detect a 30% relative lift; one with 5,000 per variant can detect a 5% lift. Most lifecycle teams run with the former and claim to measure the latter.
Before running a test, calculate the minimum sample size needed to detect the lift you actually care about. A lift of less than 5% relative needs tens of thousands of conversions per variant to detect reliably — which most lifecycle programs will never accumulate on a single campaign. Design your tests to detect lifts worth detecting.
New variants often beat control for the first 3–7 days purely because they're new. Users notice difference, engage with it, and drive early metrics up. Then the novelty fades and the variant returns to baseline.
Defence: run every A/B test for at least two full cycles of your natural sending rhythm. If you send daily, that's two weeks. If you send weekly, that's a month. And report weekly lift numbers, not cumulative — a variant that's ahead on day 1 and behind on day 14 is not a winner.
Three numbers together: observed lift (%), confidence (%), and absolute conversion volume. 'Variant B had a 12% relative lift in open rate, 97% confidence, adding 340 extra clicks across 50K recipients.' Every one of those numbers changes the decision: a 12% lift at 60% confidence is noise; a 1% lift at 99% confidence on 5M recipients is a career-defining win.
The calculator above produces all three. Paste its output into your post-test report — and report the number for EVERY variant against control, not just the winner. Losing variants contain information.
Built into Orbit
Orbit's experiment design skill calculates the minimum viable sample size before you run the test, sets up the Braze variants with proper random assignment, builds the readout template, and auto-flags novelty-effect patterns in the weekly analysis.
Sign up freeGo deeper
The long-form guides that explain the thinking behind the tool. Written for operators who want to know not just what to do, but why.
experimentation · 10 min read
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
experimentation · 8 min read
Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
experimentation · 8 min read
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
experimentation · 8 min read
Segment-based testing: when your average lift is hiding opposing effects
A winning A/B test with 4% aggregate lift might be a 20% win in one segment and a 10% loss in another. The aggregate is an average of opposing effects. Segment analysis catches it — and lets you ship the win to the segments that benefit while not shipping the loss to the ones that don't.
Free A/B test statistical significance calculator using a two-proportion z-test. Supports up to 9 variants against 1 control, with confidence, lift, and p-value for each. Built for marketers running email, push, in-app, and landing page experiments.
Lifecycle and growth marketers who run experiments and need a quick significance check before calling a test.
Using Claude?
Inside Orbit for Claude, the Experiment Design skill runs significance testing natively. It pulls sample sizes and conversions straight from your Braze workspace, applies the right test, and flags when a result is or isn't ready to call. No spreadsheets, no copy-paste. Free for everyone — the Claude extension is the power-user upgrade, not a gated feature.