Skip to main content

Experiments & holdouts

Why this matters for your business

The biggest mistake merchants make in marketing isn't in the campaigns they run — it's in the conclusions they draw afterward. "We sent that email and revenue went up — email works." Maybe. Or maybe revenue would have gone up anyway because it was payday, or because Meta scaled, or because of the season. Without a control group, every "did it work?" question is a guess dressed up as a fact.

Experiments and holdouts are the antidote. An experiment runs two variants of the same change (subject line A vs B, full price vs discount, journey-on vs journey-off) on randomly-assigned customers and reports which variant won, with statistical confidence. A holdout takes a slice of an audience and gives them nothing — letting you measure what would have happened without the campaign. Both produce causal answers instead of correlations.

The compounding effect is enormous. A merchant who runs 4 experiments per quarter and acts on the winners builds a compounding edge that's invisible to competitors. Each test teaches the system (and the team) something true. Six months in, your campaign engine isn't running on intuition — it's running on verified knowledge of what works for your audience.

What this typically unlocks

OutcomeTypical lift
Campaign-level lift confidence±5% instead of "I think it worked"
Margin saved by killing under-performing journeys$15–80K/year depending on volume
Team disagreement on "did X work?"resolved by data, not loudest voice
Discount margin protection+12% — proves which discount levels actually drive lift
Velocity of learning~4× more validated insights/quarter
Confidence in budget reallocation decisionsfrom anecdote-grade to investor-grade

What you actually get

Three families of test, all with proper assignment + sample-size guardrails + automated readouts:

TypeWhat it testsOutput
A/B testTwo variants of one thing (subject line, copy, offer)Which variant won + p-value + lift
Multivariate (Pro+)Multiple factors at once (subject × send time × discount %)Which combination won + interactions
HoldoutTreatment (received) vs. holdout (received nothing)Incremental lift = causal impact

Plus three kinds of guardrails that protect you from drawing wrong conclusions:

  • Sample-size sanity check. Won't declare a winner until there's enough data. Underpowered tests get a "needs more recipients" tag.
  • Always-on holdouts. Globally hold back 1–5% of your audience from all marketing for shop-wide lift measurement.
  • Stable assignment. A customer in variant A for one test is randomly re-assigned for the next — no leakage between tests.

How it powers every part of your store

Decision validated by experimentsMechanism
Which subject line winsA/B on send
Whether discount X actually drives lift vs. no discountHoldout vs. treatment
Which channel converts best for this audienceMultivariate by channel
Whether the new welcome series beats the oldJourney-level holdout, gradual rollout
Whether ads stop a customer from churningAudience holdout on retargeting campaigns
Whether shipping threshold change earns moreAlways-on holdout, days–weeks
Whether the new copy from the brand refresh workedMultivariate copy test
Whether VIP outreach is worth the team timeHoldout — VIP cohort split treatment / control

How it works (without the technical bits)

Three things that make experiments trustworthy

  1. Random assignment. Every customer is randomly placed into one variant; the assignment is deterministic per customer so they always see the same variant within a single test.
  2. Mutually exclusive cohorts. A customer in variant A for test X gets independently random-assigned for test Y — no correlation between tests.
  3. Sample-size guardrails. A test that's "winning" by 30% on 50 customers means almost nothing — the system shows the confidence interval and refuses to declare a winner until sample size justifies the call.

Statistical confidence — what we report

Every test result includes:

  • Lift — the magnitude of the difference (e.g. variant B beat A by 14%)
  • Confidence interval — the range of likely true lifts (e.g. 6%–22% with 95% confidence)
  • p-value — the probability the difference is just noise (lower = more confident)
  • Recommended action — "ship variant B", "keep testing", "no clear winner"

You don't need to know the math. You need to know that the system won't say "you won" when you didn't.

Holdouts — measuring incrementality vs. attribution

Multi-touch attribution answers "of all my marketing, how should I credit this order?" Holdouts answer the harder question: "would this order have happened without my marketing at all?"

Treatment cohort (e.g. 95% of audience):
receives campaign / journey / ad
→ revenue per customer = $14.20

Holdout cohort (e.g. 5%):
receives nothing
→ revenue per customer = $9.40

Incremental lift = $4.80/customer × treated count = causal revenue

This is the gold standard. Attribution credits past clicks; holdout proves what wouldn't have happened without you.

Always-on shop holdouts

The most powerful experiment you'll ever run. 1–5% of your entire customer base is held out from all marketing for a quarter. At the end, you compare them to the treated 95–99%:

TreatmentAlways-on holdout
Avg order count (90d)1.40.9
Avg revenue per customer (90d)$84$52
Lift+$32/customer

If your audience is 100K customers and the lift is $32/quarter, your marketing program is generating $3.2M/quarter in incremental revenue. That's the number you take to the finance team. That's the number that justifies the marketing budget for next year.

Real merchant scenarios

Scenario A — Subject line A/B uncovers a 23% lift

Setup. Mid-market brand, 60K-recipient newsletter. Standard practice: marketing manager picks the subject line they like.

Test. A/B test on next campaign:

  • A: "Your new arrivals are here" (manager's favorite)
  • B: "We thought you'd like these" (suggested by the AI)

Result after 24h.

VariantSentOpenedOpen rateClicked
A30K4,80016.0%540
B30K5,91019.7%720

Lift: +23% on opens, +33% on clicks. Statistical significance p < 0.001. Variant B shipped to all future newsletters.

Compounding. Over 12 months, this single test (and the operational change to "always test the subject line") was estimated to drive +14% in newsletter-attributed revenue.

Scenario B — Holdout proves a journey is worth keeping

Setup. $20M brand questions whether a 5-step welcome series is worth the dev time vs. a simpler 2-step version.

Test. 10% holdout on the new-customer journey for 60 days.

CohortNew customers30d 1st-order rateAvg 1st-order value
Treatment (5-step)8,40028.4%$96
Holdout (no journey)94019.1%$89
Lift+9.3pp+$7

That's $9 × 8,400 = $76K incremental revenue per 60-day cohort from the journey alone. Decision: keep the 5-step. The extra dev complexity pays for itself many times over.

Scenario C — Discount-level optimization

Setup. Cart-recovery flow currently sends 20% off. Brand suspects 15% would work just as well at lower margin cost.

Test. Multivariate, 6 weeks:

VariantRecovered cartsAvg order valueMargin / recovered cart
20% off (control)1,420$73$14.60
15% off1,290$76$19.36
10% off1,108$78$23.40
Free shipping (no $)1,196$74$24.40

Reading the table. 20% off recovered the most carts but had the lowest margin per recovery. Free shipping won on margin even though it recovered slightly fewer.

Decision. Switched cart recovery to free shipping, kept 20% off as a "step 4 last-chance" only.

Annualized impact. ~$28K extra margin per quarter at constant cart-abandonment volume.

Scenario D — Always-on holdout reveals total ROI

Setup. $40M brand wants to justify the marketing program to the board. Marketing budget is $4M/year — what's the return?

Test. 2% always-on holdout for one quarter (Q1).

CohortCustomersRevenue per customer (Q1)
Treatment (98%)588K$186
Holdout (2%)12K$148

Lift = $38/customer × 588K = $22.3M incremental revenue in Q1. Annualized: ~$89M. Marketing budget $4M produced $89M in incremental revenue. ROI: 22×.

The CFO had been treating marketing as a $4M cost line. The board now sees it as a 22× ROI growth investment. Budget question resolved.

Scenario E — Channel matchup test

Setup. Brand uncertain whether to invest in WhatsApp or push notifications for transactional updates ("order shipped").

Test. Random 4-way assignment over 6 weeks (10K orders/cohort):

ChannelOpen / readClick on tracking linkReply rate
Email64%28%0.4%
WhatsApp89%41%6.2%
SMS78%19%0.8%
Push41%24%0%

Decision. Doubled down on WhatsApp for transactional. Push deprioritized (low open rate, no reply channel). Email kept as backup. SMS de-emphasized for transactional but kept for flash sales.

The reply rate is the surprise. WhatsApp's 6.2% reply rate became a customer-service flywheel — questions surfaced earlier, returns dropped 14%.

Scenario F — Sequential testing for incremental wins

Setup. Beauty brand commits to one test per week for 12 months. Small individual lifts, intentional compounding.

Highlights.

QuarterTest focusLift$ impact (annualized)
Q1Subject line styles (5 tests)+18% open avg$94K
Q2Send time (3 tests)+12% open at 11AM$61K
Q3Welcome series step count-1 step won$0 (cost-saving)
Q4Free-shipping threshold$50 won over $35+$132K margin

Total Year 1. $287K incremental revenue/margin from a disciplined test cadence. None of these were "big bets." The discipline of testing was the moat.

Best practices

Test one variable at a time in A/B (unless using multivariate explicitly). "Email A vs WhatsApp B with discount 20% A vs 10% B" tells you nothing — you don't know which factor caused the lift.

Plan sample size before launching. Use the built-in power calculator: "I want to detect a 5% lift with 95% confidence" → the calculator tells you you need ~3,200 recipients per variant.

Run always-on shop holdouts. Even 1% gives you the quarterly lift number. The most-asked question in board meetings ("is marketing working?") gets a real answer.

Pair attribution with holdouts. Attribution shows credit; holdouts show causality. Together: "this campaign drove $50K of incremental revenue with email getting 60% of the credit."

Document hypotheses before testing. "I think free shipping will beat 20% off because customers focus on out-of-pocket cost at checkout." Writing the prediction makes the result mean something.

Don't peek at results before sample size is sufficient. "Variant B is up 30% after 200 recipients!" can flip with the next 50 recipients. Wait for the system to declare power.

Don't run experiments that contradict each other simultaneously. A subject-line test on the welcome email during a journey-version test means you can't tell which variable mattered.

Don't use experiments to justify already-made decisions. If you'll ship variant B regardless of result, don't test — just ship. Tests are decisions.

Don't stop running holdouts once you're "sure" things work. The shop holdout is what catches degradation. Without it, you'll find out the program stopped working when revenue already dropped — too late.

Plan tiers

CapabilityFreeStarterProAgencyEnterprise
Subject-line A/B (campaigns)
Send-time A/B
Multivariate (3+ factors)
Per-campaign holdouts
Per-journey holdouts
Always-on shop holdouts
Sample-size power calculator
Statistical-significance reporting
Pre-registered hypothesis log
Cross-shop experiment library
Sequential testing controls

Frequently asked

How long should I run an A/B test? Until the power calculator says it's done. For a 50K-recipient campaign with a target detection of 5% lift, you'll need most of the audience — usually 24–48h is enough. For smaller audiences, longer.

Can I peek at results mid-test? You'll see live numbers, but the system won't declare a winner until power is reached. Resist the urge to act on early signals — they reverse more often than you'd think.

What if I have to ship something fast and don't have time to test? Ship it. Then run a holdout against the next batch of similar sends — you'll get a retroactive read on whether the change mattered. Some test is better than none.

Can multiple tests run at once? Yes — orthogonal tests (different audiences, different variables) don't interfere. The system flags conflicts automatically.

What happens if a test wins but the lift is small? You'll see the magnitude in the readout. A 1% lift with high confidence is real but probably not worth shipping if it adds operational complexity. The decision is yours.

Can I export experiment results? Yes — the v1 admin API exposes /api/v1/experiments/:id with full per-variant stats, assignment counts, and outcome metrics.

How do holdouts interact with global compliance? The holdout customer still has the right to be in your customer list, view their data, and request deletion. They just don't receive marketing during the holdout window. All other rights preserved.

See also