Anomaly detection
Why this matters for your business
Most stores discover problems too late. A pixel breaks on a site update — you find out in three weeks when conversion drifts enough to notice in monthly reports. A discount code leaks to a coupon site — you find out when margin shows up wrong on Tuesday's P&L review. A WhatsApp template gets paused by Meta — you find out when customers complain on email. Each of these costs real money, and the cost compounds with every day you don't notice.
Anomaly detection is the watchdog that finds these problems within hours of them starting. Over 50 KPIs are monitored continuously — revenue, conversion, channel-specific engagement, fatigue posture, deliverability, ROAS, customer-acquisition cost, repeat rate — and the moment any of them deviates significantly from your shop's baseline, you get an alert with the delta, the direction, and the most likely cause.
The win isn't just speed; it's framing. Without anomaly detection, "sales were down this week" is a nervous question. With anomaly detection, "loyal-cohort engagement dropped 28% on iOS Safari starting Tuesday at 11:30 AM" is a diagnostic with a starting point. The investigation becomes 30 minutes instead of two days, and you fix the bug before the quarter is lost.
What this typically unlocks
| Outcome | Typical result |
|---|---|
| Time from "something broke" → detection | 2–4 hours vs. 1–3 weeks unmonitored |
| Revenue saved per quarter from caught issues | $30K–500K depending on scale |
| False-alarm rate | < 1 alert / week at default sensitivity |
| Diagnosed root causes per alert | ~70% include automatic root-cause hint |
| Time spent on monthly metric reviews | −60% — review exceptions, not the whole dashboard |
| Confidence in flat dashboards | high — "no anomalies" is a real signal, not assumed |
What you actually get
Continuous monitoring across four KPI families:
| Family | Examples | Alert sensitivity |
|---|---|---|
| Revenue & orders | Daily revenue, conversion rate, AOV, refund rate, repeat rate | Medium (default) |
| Channel engagement | Email open/click, WhatsApp read/reply, SMS click, push CTR | Medium |
| Acquisition | New customers/day by source, CAC, first-product mix | Low (high noise) |
| Compliance & deliverability | Unsub rate, bounce rate, fatigue cap hits, opt-in rate | High (low noise, high stakes) |
Each anomaly arrives with:
- What changed — the metric, the direction, the magnitude
- When it started — the precise hour the deviation began
- How it compares — vs. last week, last 30d, same period last year
- Where it's localized — segment, channel, product, region (if detectable)
- Likely cause hint — common patterns matched (e.g. "matches profile of a broken pixel")
- Suggested actions — what to check first
How it powers every part of your store
| Anomaly type | What it lets you fix fast |
|---|---|
| Conversion rate drop | Broken checkout, slow page load, paywall popup that's too aggressive |
| Email open-rate drop | Deliverability issue, sender-reputation hit, list-quality decay |
| WhatsApp read-rate drop | Template paused by Meta, fatigue cap hit too often |
| AOV unexpected drop | Discount code leaked, free-shipping threshold change error |
| Refund-rate spike | Product quality issue, sizing mismatch, shipping breakage |
| Unsub-rate spike | Over-sending, bad list segmentation, broken unsubscribe link |
| New-customer drop | Acquisition channel paused, ad account suspended, pixel broken |
| CAC spike | Ad-cost rise, ROAS dropped, audience saturated |
| Fatigue-cap-hits spike | Too many concurrent journeys, campaign double-run |
| Opt-in rate drop | Storefront widget broken, consent text change |
How it works (without the technical bits)
Baselines — robust to weekly cycles
Naive "compare to yesterday" alerts fire constantly because of weekly patterns (Saturdays don't look like Wednesdays). Our baselines use rolling median + median absolute deviation over the last 30 days, bucketed by day-of-week and hour-of-day. So:
- Today at 11 AM is compared to the last 4 Tuesdays at 11 AM
- Daily revenue today is compared to the last 30 same-weekdays
- Holiday weeks are flagged and excluded from baseline (you don't want last week's Black Friday spike making this Monday look flat by comparison)
The result: alerts fire when something unusual happens, not when Tuesday isn't Wednesday.
Severity — what actually pages you
Three levels:
| Severity | Trigger | Channel |
|---|---|---|
| P1 — Critical | > 4× sensitivity band; high-stakes KPIs (revenue, conversion, refunds) | In-app + email + SMS + Slack (if integrated) |
| P2 — Warning | 2–4× band; medium-stakes KPIs | In-app + email + Slack |
| P3 — Info | 1–2× band; informational | In-app only |
You can override sensitivity per KPI. A merchant who runs flash sales might set "AOV anomalies" to lower sensitivity (it spikes naturally). A merchant who just changed shipping thresholds might set "AOV anomalies" to higher sensitivity for a week to catch unintended consequences.
Suppression — knowing when not to alert
Three classes of suppression:
- Self-suppression on cause known. Acknowledge an anomaly with a reason ("known pixel issue, fixing tomorrow") — same anomaly won't re-fire for 24 hours.
- Coordinated alert dedup. A single root cause that affects 10 KPIs (e.g. pixel break) fires one compound alert, not 10.
- Quiet hours. You can set quiet hours on P3 (info) alerts; P1 always pages.
Localization — narrowing the search
The most useful part of an anomaly alert is the where:
"Conversion rate dropped 28% — confined to:
- Device: iOS Safari only
- Time: starting 2026-05-09 11:30 AM
- Affected pages: checkout (specifically /cart and /checkout)
- Geography: not localized (worldwide)
- Cohort: not localized (all lifecycle stages)"
That alert points at one investigation: a Safari-specific change to checkout deployed at 11:30 AM. Without localization, the investigation is "look at everything that happened this week." With it, the fix is in the next deploy.
Likely-cause hints
The system maintains a library of anomaly patterns — combinations of KPI changes that almost always have one cause:
| Pattern | Likely cause |
|---|---|
| Open rate ↓ across all channels + bounce rate ↑ | Sender reputation / deliverability issue |
| Conversion ↓ + page-load-time ↑ on same device | Site performance regression |
| New-customer ↓ on one channel only | Channel acquisition paused or pixel broken |
| AOV ↓ + Stripe avg-discount ↑ | Discount code leaked / over-redeemed |
| Refunds ↑ on one SKU | Product quality / shipping issue with that SKU |
| WA read-rate ↓ + WA send count = 0 | Template paused by Meta or account suspended |
| Repeat rate ↓ + journey enrolment count = 0 | Journey worker stuck / paused |
When the alert matches a pattern, the cause hint shows up in the notification — saving you the diagnostic step.
Real merchant scenarios
Scenario A — Catches a broken pixel within 4 hours
Setup. Mid-market brand pushed a site redesign overnight. Storefront pixel reference URL changed; pixel stopped firing.
Detection at 4:12 AM. Anomaly fired:
P1: Pixel events ↓ 96% — starting 2026-05-08 23:18
Localized: all pages, all devices
Likely cause: pixel deployment issue
Suggested: check storefront pixel install
Investigation took 8 minutes. Engineer rolled back pixel file at 4:25 AM. Total downtime: ~5 hours overnight.
Cost saved. Without detection, the pixel break would have been noticed when conversion-attribution looked off in Friday's weekly review — 4 days later. Estimated revenue impact:
| Window | Pixel-fed campaigns affected | Estimated lost revenue |
|---|---|---|
| 5 hours (caught) | 0 — overnight, low traffic | ~$0 |
| 4 days (uncaught) | 8 retargeting campaigns | ~$45K |
The alert paid for the entire feature in one incident.
Scenario B — Discount code leaked to a coupon site
Setup. Brand shared "WELCOME20" with email subscribers. Someone posted it to a coupon aggregator. Redemption rate exploded.
Anomaly fired at 2:30 PM (4 hours after the leak):
P2: Discount-redemption rate ↑ 740% — starting 2026-05-08 10:32
Affected code: WELCOME20
Localized: traffic source — heavy "couponcode.com" referral
Likely cause: discount code distribution leak
Suggested: review redemption sources, consider code rotation
Action. Brand swapped the code (WELCOME20 → WELCOME20-NEW), sent the new code to opted-in customers only. Redemption returned to baseline within 2 hours.
Cost saved. Pre-rotation: ~$8,400/hour in unintended margin loss. Quick detection saved ~$45K.
Scenario C — Meta paused a WhatsApp template
Setup. Brand uses a WhatsApp template for cart recovery.
Meta auto-paused the template due to a flag (the template
included "FREE" in caps which Meta now classifies as
promotional spam).
Detection at 9:18 AM (template paused at 9:02 AM):
P1: WhatsApp template send rate = 0 — starting 2026-05-08 09:02
Affected template: cart_recovery_v2
Localized: only this template; other templates fine
Likely cause: template paused/rejected by provider
Suggested: check WhatsApp Business Manager template status
Action. Marketing manager logged into WA Business Manager, confirmed pause, edited template (removed all-caps "FREE"), resubmitted. Approved 2 hours later.
Cost saved. Cart recovery normally drives $1,800/day at this
brand. Detected within 16 minutes of breakage, fixed within 2.5
hours. Without detection: would have been noticed in weekly
review ($12K lost).
Scenario D — Catching a churn cohort 90 days early
Setup. Apparel brand. One specific cohort ("repeat buyers, ages 25–34, mobile") started showing engagement decline. Revenue unaffected so far (they were ordering on habit).
Anomaly at week 3 of the trend:
P3: Email engagement on cohort "repeat-25-34-mobile" ↓ 28% — starting 2026-04-12
Localized: device — Android Chrome older versions
Likely cause: rendering issue on older browsers or display bug
Suggested: review recent template changes, check Android
Chrome compatibility on older versions
Investigation. Email template change 3 weeks earlier had broken on a specific older Android browser (Chrome 110 and below). That cohort's primary device.
Fix shipped in 1 week. Engagement recovered. Revenue never dropped because the brand caught it during the engagement-leading- indicator phase, not the revenue-trailing-indicator phase.
This is the highest-leverage use of anomaly detection — leading indicators flag problems before lagging indicators (revenue) have to.
Scenario E — Subscription churn surge from one bug
Setup. Subscription box. Skip-this-month feature broke on a deploy — customers couldn't skip, so they auto-churned instead.
Detection at 6:40 AM (deploy at 02:13 AM):
P1: Subscription cancellation rate ↑ 11× — starting 2026-05-08 02:14
Localized: customers who clicked "skip" on the dashboard
Likely cause: subscription action / dashboard regression
Suggested: check recent deploy + skip flow
Action. Engineering reverted the deploy at 7:02 AM. Re-activated the 47 customers who'd accidentally cancelled with an apologetic email + 1 free month.
Cost saved. 47 customers × ~$80/month × 6 month avg remaining LTV = ~$22.5K. Plus brand reputation. Plus the operational chaos that would have followed.
Scenario F — False alarm rate matters
Setup. Brand previously used a generic monitoring tool with naive "deviation from yesterday" alerts. Got 5–8 alerts per day, mostly noise. Stopped reading them.
On this platform. Anomaly detection uses day-of-week-aware baselines + severity classification + suppression. Result:
| Period | Alerts received | Alerts actioned |
|---|---|---|
| Generic tool (week) | ~30 | ~2 |
| This platform (week) | ~3 | ~3 |
Read rate went from ~7% → ~100%. That's the difference between "noise" and "signal" — and it's why the suppressed cause hints + day-of-week baselining matter so much.
Best practices
✅ Tune sensitivity per KPI. Default is good; some brands (flash-sale heavy) should reduce sensitivity on AOV, while others (deliverability-focused) should increase on bounce rate.
✅ Always set up Slack integration if you have one. P1 alerts hitting Slack are 10× more actionable than email-only.
✅ Read every P1 within 15 minutes of arrival. That's the operational discipline that makes the system worth it.
✅ Acknowledge anomalies with cause notes. "Pixel issue, deploying fix" lets the system suppress for 24h and avoids alert fatigue.
✅ Use info-only alerts for trend awareness. P3 anomalies on engagement metrics catch slow drifts that revenue-only monitoring misses.
✅ Run a quarterly "alert audit" — review the last 90 days of P1+P2 alerts. Adjust sensitivity on any noisy ones; investigate any that didn't fire when they should have.
❌ Don't disable alerts because they're inconvenient. If a KPI keeps firing, fix the underlying volatility (or tune sensitivity), don't silence the canary.
❌ Don't treat correlation as causation. "Conversion dropped and we changed the homepage" doesn't prove the homepage caused it. The localization tells you where to investigate, not why.
❌ Don't act on a single P3 alert. Info-level alerts can flicker; wait for a sustained pattern (3+ data points) before acting.
❌ Don't forget to re-enable alerts you suppressed. The 24h auto-resume covers most cases, but manually-disabled KPIs stay disabled until you re-enable them.
Plan tiers
| Capability | Free | Starter | Pro | Agency | Enterprise |
|---|---|---|---|---|---|
| Daily revenue + order anomalies | — | ✓ | ✓ | ✓ | ✓ |
| Channel engagement anomalies | — | — | ✓ | ✓ | ✓ |
| Compliance / deliverability anomalies | — | — | ✓ | ✓ | ✓ |
| Custom KPI tracking | — | — | — | ✓ | ✓ |
| Day-of-week baseline | — | ✓ | ✓ | ✓ | ✓ |
| Severity classification | — | ✓ | ✓ | ✓ | ✓ |
| Localization (segment / device / channel) | — | — | ✓ | ✓ | ✓ |
| Likely-cause hints | — | — | ✓ | ✓ | ✓ |
| Slack integration | — | — | ✓ | ✓ | ✓ |
| SMS alerts (P1 only) | — | — | — | ✓ | ✓ |
| Multi-shop anomaly roll-up | — | — | — | ✓ | ✓ |
| Programmatic webhook on alert | — | — | ✓ | ✓ | ✓ |
Frequently asked
How fast does an anomaly fire? Hourly KPIs fire within ~10 minutes of the deviating sample; daily KPIs fire within an hour of midnight UTC. P1 alerts notify immediately; P2/P3 batch up to 30 minutes.
Can I tune which KPIs are monitored? Yes — every default KPI can be enabled, disabled, or have its sensitivity tuned. Custom KPIs (Agency+) can be added with a formula; e.g. "ratio of cart-recovery sends to cart-abandonment events".
What if my brand has natural volatility (e.g. flash sales)? The system handles known events: schedule a flash sale and the baselines exclude it from "normal" calculations. For unknown volatility, increase sensitivity on the affected KPI.
Can anomalies trigger automated actions? Yes (Pro+) — anomaly webhook fires on any alert; you can wire that to "pause campaigns" / "notify on-call" / "trigger investigation flow". Common pattern: P1 anomaly on revenue → auto-pause all promotional sends until acknowledged.
Does this catch everything? No detection system is perfect. The system is tuned for ~85% recall (catches 85% of real issues) at < 1 false alarm/week per KPI. The trade-off is intentional — false alarm rate above 1/week trains people to ignore alerts, which is worse than missing some.
Can I see historical anomaly patterns? Yes — the anomaly history tab shows every alert ever fired, with acknowledgment status, cause notes, and resolution. Useful for post-mortem reviews and pattern-spotting.
What's the difference between this and a metrics dashboard? Dashboards require you to look at them. Anomaly detection pages you when something matters. Both useful — but the proactive paging is what catches problems fast.
See also
- Customer 360 — the source of much of the underlying data
- Attribution & revenue — anomalies on attributed revenue
- Experiments & holdouts — anomalies often point at where to test next
- Campaigns — anomalies fire on campaign-related metrics
- Internal: Worker-down runbook — anomaly that paged you
- Sales engine overview