Skip to main content

Anomaly detection

Why this matters for your business

Most stores discover problems too late. A pixel breaks on a site update — you find out in three weeks when conversion drifts enough to notice in monthly reports. A discount code leaks to a coupon site — you find out when margin shows up wrong on Tuesday's P&L review. A WhatsApp template gets paused by Meta — you find out when customers complain on email. Each of these costs real money, and the cost compounds with every day you don't notice.

Anomaly detection is the watchdog that finds these problems within hours of them starting. Over 50 KPIs are monitored continuously — revenue, conversion, channel-specific engagement, fatigue posture, deliverability, ROAS, customer-acquisition cost, repeat rate — and the moment any of them deviates significantly from your shop's baseline, you get an alert with the delta, the direction, and the most likely cause.

The win isn't just speed; it's framing. Without anomaly detection, "sales were down this week" is a nervous question. With anomaly detection, "loyal-cohort engagement dropped 28% on iOS Safari starting Tuesday at 11:30 AM" is a diagnostic with a starting point. The investigation becomes 30 minutes instead of two days, and you fix the bug before the quarter is lost.

What this typically unlocks

OutcomeTypical result
Time from "something broke" → detection2–4 hours vs. 1–3 weeks unmonitored
Revenue saved per quarter from caught issues$30K–500K depending on scale
False-alarm rate< 1 alert / week at default sensitivity
Diagnosed root causes per alert~70% include automatic root-cause hint
Time spent on monthly metric reviews−60% — review exceptions, not the whole dashboard
Confidence in flat dashboardshigh — "no anomalies" is a real signal, not assumed

What you actually get

Continuous monitoring across four KPI families:

FamilyExamplesAlert sensitivity
Revenue & ordersDaily revenue, conversion rate, AOV, refund rate, repeat rateMedium (default)
Channel engagementEmail open/click, WhatsApp read/reply, SMS click, push CTRMedium
AcquisitionNew customers/day by source, CAC, first-product mixLow (high noise)
Compliance & deliverabilityUnsub rate, bounce rate, fatigue cap hits, opt-in rateHigh (low noise, high stakes)

Each anomaly arrives with:

  • What changed — the metric, the direction, the magnitude
  • When it started — the precise hour the deviation began
  • How it compares — vs. last week, last 30d, same period last year
  • Where it's localized — segment, channel, product, region (if detectable)
  • Likely cause hint — common patterns matched (e.g. "matches profile of a broken pixel")
  • Suggested actions — what to check first

How it powers every part of your store

Anomaly typeWhat it lets you fix fast
Conversion rate dropBroken checkout, slow page load, paywall popup that's too aggressive
Email open-rate dropDeliverability issue, sender-reputation hit, list-quality decay
WhatsApp read-rate dropTemplate paused by Meta, fatigue cap hit too often
AOV unexpected dropDiscount code leaked, free-shipping threshold change error
Refund-rate spikeProduct quality issue, sizing mismatch, shipping breakage
Unsub-rate spikeOver-sending, bad list segmentation, broken unsubscribe link
New-customer dropAcquisition channel paused, ad account suspended, pixel broken
CAC spikeAd-cost rise, ROAS dropped, audience saturated
Fatigue-cap-hits spikeToo many concurrent journeys, campaign double-run
Opt-in rate dropStorefront widget broken, consent text change

How it works (without the technical bits)

Baselines — robust to weekly cycles

Naive "compare to yesterday" alerts fire constantly because of weekly patterns (Saturdays don't look like Wednesdays). Our baselines use rolling median + median absolute deviation over the last 30 days, bucketed by day-of-week and hour-of-day. So:

  • Today at 11 AM is compared to the last 4 Tuesdays at 11 AM
  • Daily revenue today is compared to the last 30 same-weekdays
  • Holiday weeks are flagged and excluded from baseline (you don't want last week's Black Friday spike making this Monday look flat by comparison)

The result: alerts fire when something unusual happens, not when Tuesday isn't Wednesday.

Severity — what actually pages you

Three levels:

SeverityTriggerChannel
P1 — Critical> 4× sensitivity band; high-stakes KPIs (revenue, conversion, refunds)In-app + email + SMS + Slack (if integrated)
P2 — Warning2–4× band; medium-stakes KPIsIn-app + email + Slack
P3 — Info1–2× band; informationalIn-app only

You can override sensitivity per KPI. A merchant who runs flash sales might set "AOV anomalies" to lower sensitivity (it spikes naturally). A merchant who just changed shipping thresholds might set "AOV anomalies" to higher sensitivity for a week to catch unintended consequences.

Suppression — knowing when not to alert

Three classes of suppression:

  • Self-suppression on cause known. Acknowledge an anomaly with a reason ("known pixel issue, fixing tomorrow") — same anomaly won't re-fire for 24 hours.
  • Coordinated alert dedup. A single root cause that affects 10 KPIs (e.g. pixel break) fires one compound alert, not 10.
  • Quiet hours. You can set quiet hours on P3 (info) alerts; P1 always pages.

The most useful part of an anomaly alert is the where:

"Conversion rate dropped 28% — confined to:
- Device: iOS Safari only
- Time: starting 2026-05-09 11:30 AM
- Affected pages: checkout (specifically /cart and /checkout)
- Geography: not localized (worldwide)
- Cohort: not localized (all lifecycle stages)"

That alert points at one investigation: a Safari-specific change to checkout deployed at 11:30 AM. Without localization, the investigation is "look at everything that happened this week." With it, the fix is in the next deploy.

Likely-cause hints

The system maintains a library of anomaly patterns — combinations of KPI changes that almost always have one cause:

PatternLikely cause
Open rate ↓ across all channels + bounce rate ↑Sender reputation / deliverability issue
Conversion ↓ + page-load-time ↑ on same deviceSite performance regression
New-customer ↓ on one channel onlyChannel acquisition paused or pixel broken
AOV ↓ + Stripe avg-discount ↑Discount code leaked / over-redeemed
Refunds ↑ on one SKUProduct quality / shipping issue with that SKU
WA read-rate ↓ + WA send count = 0Template paused by Meta or account suspended
Repeat rate ↓ + journey enrolment count = 0Journey worker stuck / paused

When the alert matches a pattern, the cause hint shows up in the notification — saving you the diagnostic step.

Real merchant scenarios

Scenario A — Catches a broken pixel within 4 hours

Setup. Mid-market brand pushed a site redesign overnight. Storefront pixel reference URL changed; pixel stopped firing.

Detection at 4:12 AM. Anomaly fired:

P1: Pixel events ↓ 96% — starting 2026-05-08 23:18
Localized: all pages, all devices
Likely cause: pixel deployment issue
Suggested: check storefront pixel install

Investigation took 8 minutes. Engineer rolled back pixel file at 4:25 AM. Total downtime: ~5 hours overnight.

Cost saved. Without detection, the pixel break would have been noticed when conversion-attribution looked off in Friday's weekly review — 4 days later. Estimated revenue impact:

WindowPixel-fed campaigns affectedEstimated lost revenue
5 hours (caught)0 — overnight, low traffic~$0
4 days (uncaught)8 retargeting campaigns~$45K

The alert paid for the entire feature in one incident.

Scenario B — Discount code leaked to a coupon site

Setup. Brand shared "WELCOME20" with email subscribers. Someone posted it to a coupon aggregator. Redemption rate exploded.

Anomaly fired at 2:30 PM (4 hours after the leak):

P2: Discount-redemption rate ↑ 740% — starting 2026-05-08 10:32
Affected code: WELCOME20
Localized: traffic source — heavy "couponcode.com" referral
Likely cause: discount code distribution leak
Suggested: review redemption sources, consider code rotation

Action. Brand swapped the code (WELCOME20 → WELCOME20-NEW), sent the new code to opted-in customers only. Redemption returned to baseline within 2 hours.

Cost saved. Pre-rotation: ~$8,400/hour in unintended margin loss. Quick detection saved ~$45K.

Scenario C — Meta paused a WhatsApp template

Setup. Brand uses a WhatsApp template for cart recovery. Meta auto-paused the template due to a flag (the template included "FREE" in caps which Meta now classifies as promotional spam).

Detection at 9:18 AM (template paused at 9:02 AM):

P1: WhatsApp template send rate = 0 — starting 2026-05-08 09:02
Affected template: cart_recovery_v2
Localized: only this template; other templates fine
Likely cause: template paused/rejected by provider
Suggested: check WhatsApp Business Manager template status

Action. Marketing manager logged into WA Business Manager, confirmed pause, edited template (removed all-caps "FREE"), resubmitted. Approved 2 hours later.

Cost saved. Cart recovery normally drives $1,800/day at this brand. Detected within 16 minutes of breakage, fixed within 2.5 hours. Without detection: would have been noticed in weekly review ($12K lost).

Scenario D — Catching a churn cohort 90 days early

Setup. Apparel brand. One specific cohort ("repeat buyers, ages 25–34, mobile") started showing engagement decline. Revenue unaffected so far (they were ordering on habit).

Anomaly at week 3 of the trend:

P3: Email engagement on cohort "repeat-25-34-mobile" ↓ 28% — starting 2026-04-12
Localized: device — Android Chrome older versions
Likely cause: rendering issue on older browsers or display bug
Suggested: review recent template changes, check Android
Chrome compatibility on older versions

Investigation. Email template change 3 weeks earlier had broken on a specific older Android browser (Chrome 110 and below). That cohort's primary device.

Fix shipped in 1 week. Engagement recovered. Revenue never dropped because the brand caught it during the engagement-leading- indicator phase, not the revenue-trailing-indicator phase.

This is the highest-leverage use of anomaly detection — leading indicators flag problems before lagging indicators (revenue) have to.

Scenario E — Subscription churn surge from one bug

Setup. Subscription box. Skip-this-month feature broke on a deploy — customers couldn't skip, so they auto-churned instead.

Detection at 6:40 AM (deploy at 02:13 AM):

P1: Subscription cancellation rate ↑ 11× — starting 2026-05-08 02:14
Localized: customers who clicked "skip" on the dashboard
Likely cause: subscription action / dashboard regression
Suggested: check recent deploy + skip flow

Action. Engineering reverted the deploy at 7:02 AM. Re-activated the 47 customers who'd accidentally cancelled with an apologetic email + 1 free month.

Cost saved. 47 customers × ~$80/month × 6 month avg remaining LTV = ~$22.5K. Plus brand reputation. Plus the operational chaos that would have followed.

Scenario F — False alarm rate matters

Setup. Brand previously used a generic monitoring tool with naive "deviation from yesterday" alerts. Got 5–8 alerts per day, mostly noise. Stopped reading them.

On this platform. Anomaly detection uses day-of-week-aware baselines + severity classification + suppression. Result:

PeriodAlerts receivedAlerts actioned
Generic tool (week)~30~2
This platform (week)~3~3

Read rate went from ~7% → ~100%. That's the difference between "noise" and "signal" — and it's why the suppressed cause hints + day-of-week baselining matter so much.

Best practices

Tune sensitivity per KPI. Default is good; some brands (flash-sale heavy) should reduce sensitivity on AOV, while others (deliverability-focused) should increase on bounce rate.

Always set up Slack integration if you have one. P1 alerts hitting Slack are 10× more actionable than email-only.

Read every P1 within 15 minutes of arrival. That's the operational discipline that makes the system worth it.

Acknowledge anomalies with cause notes. "Pixel issue, deploying fix" lets the system suppress for 24h and avoids alert fatigue.

Use info-only alerts for trend awareness. P3 anomalies on engagement metrics catch slow drifts that revenue-only monitoring misses.

Run a quarterly "alert audit" — review the last 90 days of P1+P2 alerts. Adjust sensitivity on any noisy ones; investigate any that didn't fire when they should have.

Don't disable alerts because they're inconvenient. If a KPI keeps firing, fix the underlying volatility (or tune sensitivity), don't silence the canary.

Don't treat correlation as causation. "Conversion dropped and we changed the homepage" doesn't prove the homepage caused it. The localization tells you where to investigate, not why.

Don't act on a single P3 alert. Info-level alerts can flicker; wait for a sustained pattern (3+ data points) before acting.

Don't forget to re-enable alerts you suppressed. The 24h auto-resume covers most cases, but manually-disabled KPIs stay disabled until you re-enable them.

Plan tiers

CapabilityFreeStarterProAgencyEnterprise
Daily revenue + order anomalies
Channel engagement anomalies
Compliance / deliverability anomalies
Custom KPI tracking
Day-of-week baseline
Severity classification
Localization (segment / device / channel)
Likely-cause hints
Slack integration
SMS alerts (P1 only)
Multi-shop anomaly roll-up
Programmatic webhook on alert

Frequently asked

How fast does an anomaly fire? Hourly KPIs fire within ~10 minutes of the deviating sample; daily KPIs fire within an hour of midnight UTC. P1 alerts notify immediately; P2/P3 batch up to 30 minutes.

Can I tune which KPIs are monitored? Yes — every default KPI can be enabled, disabled, or have its sensitivity tuned. Custom KPIs (Agency+) can be added with a formula; e.g. "ratio of cart-recovery sends to cart-abandonment events".

What if my brand has natural volatility (e.g. flash sales)? The system handles known events: schedule a flash sale and the baselines exclude it from "normal" calculations. For unknown volatility, increase sensitivity on the affected KPI.

Can anomalies trigger automated actions? Yes (Pro+) — anomaly webhook fires on any alert; you can wire that to "pause campaigns" / "notify on-call" / "trigger investigation flow". Common pattern: P1 anomaly on revenue → auto-pause all promotional sends until acknowledged.

Does this catch everything? No detection system is perfect. The system is tuned for ~85% recall (catches 85% of real issues) at < 1 false alarm/week per KPI. The trade-off is intentional — false alarm rate above 1/week trains people to ignore alerts, which is worse than missing some.

Can I see historical anomaly patterns? Yes — the anomaly history tab shows every alert ever fired, with acknowledgment status, cause notes, and resolution. Useful for post-mortem reviews and pattern-spotting.

What's the difference between this and a metrics dashboard? Dashboards require you to look at them. Anomaly detection pages you when something matters. Both useful — but the proactive paging is what catches problems fast.

See also