Most A/B tests SaaS teams run are underpowered, which is why the wins never seem to compound. Here is how to calculate sample size correctly, pick a sensible MDE, and run tests that actually change the business.

A/B Test Sample Size in 2026: How to Calculate It and Stop Shipping Fake Wins

Most growth teams I see in 2026 run far more A/B tests than they can statistically support. They run 20 tests a quarter, declare 12 winners, ship them, and nothing moves on the dashboard. The reason is almost always the same: the tests were underpowered, the "winners" were noise, and the team has been congratulating itself on a year of fake wins.

The single most important number in any A/B test is the sample size you need per variant before the math can support a verdict. Get that number right up front and most of the bad habits (peeking early, stopping on a Tuesday spike, calling a 7 percent lift on 500 visitors) disappear.

I built a free A/B test sample size calculator that runs the standard two-proportion power calculation so you can see exactly how many visitors per variant you need before you start a test.

What sample size actually controls

Sample size is the lever between three things you cannot lower without consequence: how small a lift you want to detect, how confident you want to be that a lift you call is real, and how often you are willing to miss a real lift entirely. Lower any of those bars and the math demands more users in each variant. There is no shortcut and there is no clever trick. Every "we ran a quick test on 300 visitors and saw a 12 percent lift" is a coin flip dressed up as evidence.

The standard inputs:

The baseline conversion rate is your current control. A 2 percent checkout conversion needs dramatically more sample than a 20 percent landing-page conversion because rare events are inherently noisier.

The minimum detectable effect (MDE) is the smallest relative lift you would actually act on. A 5 percent MDE on a 2 percent baseline needs roughly 60,000 users per variant. A 20 percent MDE on the same baseline needs roughly 4,000 per variant. The relationship is brutal: halving the MDE quadruples the sample.

The significance level (alpha) is your false positive tolerance. Industry standard is 0.05, meaning you accept a 5 percent chance of calling a tie a winner. Lower alpha to 0.01 for revenue-touching changes and you roughly double the required sample.

The power is your tolerance for false negatives, missing a real effect. Standard is 0.80, meaning you catch a real lift 80 percent of the time. Below that, you are running tests that the math cannot finish.

Picking a realistic MDE

The MDE is where most teams sabotage themselves. They set it too low because they want to feel like a precise team, then write a test that would need three years of traffic to conclude.

The honest way to pick MDE is to ask: what lift would actually change what we do? A 2 percent lift on a free trial signup page is not changing anything; you would not even notice it in your monthly cohort review. A 15 percent lift on the same page would obviously change next quarter's growth model. Set the MDE at the size of decision-relevant lift, not the size of effect you secretly hope exists.

A practical anchor: most product changes that ship and matter deliver 5 to 30 percent relative lift on a single funnel step. Below 5 percent is the noise floor for almost every product. Above 30 percent usually requires a rewrite of the page, not a tweak.

Why peeking is the silent killer

The standard A/B test math assumes you peek once, at the end, after the planned sample size is reached. Every additional peek where you consider stopping early multiplies the false positive rate. Teams that peek every morning and call winners early ship a parade of fake wins, then wonder why the cumulative impact on revenue never materializes.

If you genuinely need to monitor a test in flight (and most teams do not, they just want to), use sequential testing or always-valid p-values, which are designed to handle continuous monitoring. The major experimentation platforms in 2026 support both. The bigger discipline issue is: most teams that peek are not catching catastrophic regressions, they are looking for an excuse to ship faster.

Run time, not just sample size

Even if your test hits its sample size on Tuesday, do not stop the test on Tuesday. Conversion behavior shifts across weekdays, mobile and desktop, and any active marketing campaigns. The minimum useful test duration is one full business week. The standard is two. The right rule: test duration equals the maximum of (time to hit sample size), (two complete weekly cycles), and (any full marketing cycle that touches the page).

Once you internalize this, you stop running 50 tiny tests a quarter and start running 8 tests that actually conclude.

When A/B testing does not earn its keep

Most early-stage MVPs (under 5,000 weekly conversions on the page you want to test) should not be running formal A/B tests at all. The sample math will not support detectable effects in reasonable time, and the cost of running underpowered tests is worse than not testing: it gives you false certainty about what works.

For low-traffic stages, the right moves are: ship and observe, run before-and-after analyses with clear windows, instrument event funnels so you catch regressions, do qualitative user interviews when something does not work, and save A/B testing for the day your page can support it. If you are not sure what stage you are in, use the product-market fit score to gut-check whether you should be optimizing or still searching.

A simple workflow that works

The teams I watch ship real wins follow roughly the same loop.

They start with a hypothesis specific enough that the test could fail. "Adding social proof above the CTA will lift trial signups by 15 percent because users currently bounce on the trust signal" is a hypothesis. "Let's test some social proof" is not.

They run the sample size calculator up front and screenshot the inputs into the experiment ticket. Future them, peeking, sees the planned sample size and remembers not to call it early.

They commit to a minimum two-week run and a single end-of-test analysis.

They report results in both directions: lift point estimate and confidence interval. "We saw a 12 percent lift, 95 percent confidence interval -2 to +26 percent" is honest. "We saw a 12 percent lift" is misleading because the interval includes zero.

They keep a quarterly log of tests with their planned MDE, observed lift, and shipped/not-shipped decision. After three quarters this log becomes the most valuable artifact the growth team owns. It tells you what actually moved the business, where you were running tests for theater, and which categories of change deserve more investment.

The compounding effect

Done well, a year of properly sized A/B testing compounds. Done poorly, it accumulates into a year of cargo-culted optimization where the team feels busy and the metrics do not move. The difference between those two outcomes is almost entirely set by the sample size math you do (or skip) in the first 10 minutes of every test.

Run your next test through the A/B test sample size calculator before you start. If the number looks impossibly large, raise your MDE or pick a higher-traffic page. If the number is comfortable, lock it, and do not peek.

Building an MVP? Week One Labs ships custom-coded SaaS and mobile MVPs in fixed-price 14-day sprints. Book a free scope call to get a plan for your build.

A/B Test Sample Size in 2026: How to Calculate It and Stop Shipping Fake Wins

A/B Test Sample Size in 2026: How to Calculate It and Stop Shipping Fake Wins

What sample size actually controls

Picking a realistic MDE

Why peeking is the silent killer

Run time, not just sample size

When A/B testing does not earn its keep

A simple workflow that works

The compounding effect

Free tools from Week One Labs