What is statistical significance in A/B testing?

Statistical significance is the probability that an observed difference between variants is not due to random chance. Commonly expressed as a p-value; p < 0.05 means 95% confidence that the difference is real.

What test does this calculator use?

A two-proportion z-test, which is the standard for comparing conversion rates between two groups when sample sizes are large enough.

How large does my sample need to be?

The z-test assumes each group has at least ~30 conversions. Below that, confidence intervals widen and results become unreliable.

What's the difference between confidence and p-value?

They're complements: 95% confidence = p < 0.05. Both express the same idea from different angles.

Can I run more than one variant?

Yes. This calculator supports up to 9 variants vs 1 control. When comparing many variants, consider a Bonferroni correction to avoid false positives.

What is statistical significance in A/B testing?

Statistical significance measures the probability that the difference you observed between variants is real rather than random chance. It's expressed as a p-value: p < 0.05 (95% confidence) is the conventional threshold for 'significant'. It does NOT measure how big the difference is — only how confident you can be it's not random.

What sample size do I need for an A/B test?

Depends on the lift you want to detect. A 20% relative lift needs roughly 400 conversions per variant. A 10% lift needs ~1,600. A 5% lift needs ~6,400. A 1% lift needs ~160,000 per variant. Most lifecycle email tests in smaller programs will only ever detect lifts of 10%+ reliably — design your tests around that constraint.

Is 95% confidence enough to declare a winner?

95% is the convention but not a rule. For low-stakes decisions (subject line variations, send time shifts) 90% is fine. For high-stakes decisions that lock in strategy for a year (onboarding flow overhauls, channel strategy) require 99% confidence and validate with a follow-up test. 95% means a 1-in-20 false-positive rate — running enough tests and you'll hit that coincidence eventually.

How long should I run an A/B test?

Run for at least two full cycles of your natural sending rhythm to neutralise the novelty effect. If you send daily campaigns, that's 2 weeks. If you send weekly, that's a month. Continue past significance if the lift is thin — early significance on small samples is often a false positive that evaporates with more data.

Can I test more than 2 variants at once?

Yes, but each additional variant inflates the false-positive rate. With 1 control vs 9 variants at 95% confidence each, there's a ~40% chance at least one variant hits significance purely by chance. The calculator above handles up to 9 variants vs 1 control — when running multi-variant, tighten the confidence threshold (use 99% per variant) or apply Bonferroni correction.

What does p-value mean in practical terms?

p-value is the probability that the difference between variants happened by random chance. p = 0.02 means there's a 2% chance the observed lift is a fluke. Lower p-values mean more confidence. It does NOT measure the size or importance of the lift — a tiny lift with huge sample size can have a very low p-value and still be operationally meaningless.

What's a 'novelty effect' in A/B testing?

New variants often outperform control for the first 3–7 days purely because users notice them. Engagement spikes, then fades back to baseline as novelty wears off. Watching a cumulative metric obscures this pattern; reporting weekly incremental metrics reveals it. Always run tests long enough for the novelty to wash out before declaring a winner.

← All apps

Statistical Significance

Name: Orbit
Availability: InStock
Author: Justin Williames

Z-test for A/B/n experiment results. Compare a control against up to 9 variants.

Control (A)

RecipientsConversions

Variant B

RecipientsConversions

How to read A/B test results in lifecycle marketing

A 'significant' result is not the same as a meaningful result. Most lifecycle A/B tests that declare a winner at p < 0.05 are either under-powered (false positives) or measuring a one-week novelty effect (not a real lift). Here's how to avoid both traps.

Statistical significance, explained without jargon

Significance answers one question: how likely is it that the difference you're seeing between variant and control is real vs random noise? A two-proportion z-test (the math behind the calculator above) gives you a p-value. p < 0.05 means there's a less-than-5% chance the difference is random.

95% confidence is the standard threshold but it's a convention, not a law. For a decision worth reversing later (a subject line test), 90% confidence is fine. For a decision that locks in a strategy for a year (an onboarding flow overhaul), demand 99% confidence — and verify it in a follow-up test before committing.

Why most lifecycle tests are under-powered

Under-powered tests produce unreliable winners. Sample size matters more than duration — an A/B test with 500 conversions per variant can detect a 30% relative lift; one with 5,000 per variant can detect a 5% lift. Most lifecycle teams run with the former and claim to measure the latter.

Before running a test, calculate the minimum sample size needed to detect the lift you actually care about. A lift of less than 5% relative needs tens of thousands of conversions per variant to detect reliably — which most lifecycle programs will never accumulate on a single campaign. Design your tests to detect lifts worth detecting.

The novelty effect and how to spot it

New variants often beat control for the first 3–7 days purely because they're new. Users notice difference, engage with it, and drive early metrics up. Then the novelty fades and the variant returns to baseline.

Defence: run every A/B test for at least two full cycles of your natural sending rhythm. If you send daily, that's two weeks. If you send weekly, that's a month. And report weekly lift numbers, not cumulative — a variant that's ahead on day 1 and behind on day 14 is not a winner.

What to report instead of just 'we won'

Three numbers together: observed lift (%), confidence (%), and absolute conversion volume. 'Variant B had a 12% relative lift in open rate, 97% confidence, adding 340 extra clicks across 50K recipients.' Every one of those numbers changes the decision: a 12% lift at 60% confidence is noise; a 1% lift at 99% confidence on 5M recipients is a career-defining win.

The calculator above produces all three. Paste its output into your post-test report — and report the number for EVERY variant against control, not just the winner. Losing variants contain information.

Frequently asked questions

What is statistical significance in A/B testing?: Statistical significance measures the probability that the difference you observed between variants is real rather than random chance. It's expressed as a p-value: p < 0.05 (95% confidence) is the conventional threshold for 'significant'. It does NOT measure how big the difference is — only how confident you can be it's not random.
What sample size do I need for an A/B test?: Depends on the lift you want to detect. A 20% relative lift needs roughly 400 conversions per variant. A 10% lift needs ~1,600. A 5% lift needs ~6,400. A 1% lift needs ~160,000 per variant. Most lifecycle email tests in smaller programs will only ever detect lifts of 10%+ reliably — design your tests around that constraint.
Is 95% confidence enough to declare a winner?: 95% is the convention but not a rule. For low-stakes decisions (subject line variations, send time shifts) 90% is fine. For high-stakes decisions that lock in strategy for a year (onboarding flow overhauls, channel strategy) require 99% confidence and validate with a follow-up test. 95% means a 1-in-20 false-positive rate — running enough tests and you'll hit that coincidence eventually.
How long should I run an A/B test?: Run for at least two full cycles of your natural sending rhythm to neutralise the novelty effect. If you send daily campaigns, that's 2 weeks. If you send weekly, that's a month. Continue past significance if the lift is thin — early significance on small samples is often a false positive that evaporates with more data.
Can I test more than 2 variants at once?: Yes, but each additional variant inflates the false-positive rate. With 1 control vs 9 variants at 95% confidence each, there's a ~40% chance at least one variant hits significance purely by chance. The calculator above handles up to 9 variants vs 1 control — when running multi-variant, tighten the confidence threshold (use 99% per variant) or apply Bonferroni correction.
What does p-value mean in practical terms?: p-value is the probability that the difference between variants happened by random chance. p = 0.02 means there's a 2% chance the observed lift is a fluke. Lower p-values mean more confidence. It does NOT measure the size or importance of the lift — a tiny lift with huge sample size can have a very low p-value and still be operationally meaningless.
What's a 'novelty effect' in A/B testing?: New variants often outperform control for the first 3–7 days purely because users notice them. Engagement spikes, then fades back to baseline as novelty wears off. Watching a cumulative metric obscures this pattern; reporting weekly incremental metrics reveals it. Always run tests long enough for the novelty to wash out before declaring a winner.

Built into Orbit

Experiment Design skill

Orbit's experiment design skill calculates the minimum viable sample size before you run the test, sets up the Braze variants with proper random assignment, builds the readout template, and auto-flags novelty-effect patterns in the weekly analysis.

Go deeper

Guides on this topic

The long-form guides that explain the thinking behind the tool. Written for operators who want to know not just what to do, but why.

About the Statistical Significance Calculator

Free A/B test statistical significance calculator using a two-proportion z-test. Supports up to 9 variants against 1 control, with confidence, lift, and p-value for each. Built for marketers running email, push, in-app, and landing page experiments.

How to use

Enter your control sample size and conversion count.
Enter up to 9 variants with their sample sizes and conversions.
Choose a confidence level (90%, 95%, 99%).
Read confidence, lift, and p-value for each variant.

Common use cases

Email A/B/n tests (subject line, from name, content variants)
Push notification experiments
Landing page variant testing
In-app message experiments

Who it's for

Lifecycle and growth marketers who run experiments and need a quick significance check before calling a test.

Frequently asked questions

What is statistical significance in A/B testing?: Statistical significance is the probability that an observed difference between variants is not due to random chance. Commonly expressed as a p-value; p < 0.05 means 95% confidence that the difference is real.
What test does this calculator use?: A two-proportion z-test, which is the standard for comparing conversion rates between two groups when sample sizes are large enough.
How large does my sample need to be?: The z-test assumes each group has at least ~30 conversions. Below that, confidence intervals widen and results become unreliable.
What's the difference between confidence and p-value?: They're complements: 95% confidence = p < 0.05. Both express the same idea from different angles.
Can I run more than one variant?: Yes. This calculator supports up to 9 variants vs 1 control. When comparing many variants, consider a Bonferroni correction to avoid false positives.

Related Orbit tools

Using Claude?

Stop copying numbers into a calculator. Orbit does this end-to-end.

Inside Orbit for Claude, the Experiment Design skill runs significance testing natively. It pulls sample sizes and conversions straight from your Braze workspace, applies the right test, and flags when a result is or isn't ready to call. No spreadsheets, no copy-paste. Free for everyone — the Claude extension is the power-user upgrade, not a gated feature.

Orbit web apps