How quiet days are free — the detector layer
Most AWS accounts are quiet most days. The naive approach to a daily analysis is to call the LLM on every run, but that:
- Costs real money even when nothing happened
- Produces "looks good" reports that nobody reads
- Buries the rare "something interesting happened today" signal in noise
The detector layer sits between the data collection and the LLM. If no detector fires, the LLM is never called.
What the layer looks like
┌────────────────────┐
│ Collectors │ CostExplorer · CloudWatch · PI · CloudTrail · ALB · EB
└──────────┬─────────┘
│ Raw signals (numbers, events)
▼
┌────────────────────┐
│ Detector runner │ Runs each detector against the signals
└──────────┬─────────┘
│ list[Signal] — what fired, what didn't
▼
┌───────┴───────┐
│ │
ZERO NON-ZERO
│ │
▼ ▼
┌──────┐ ┌─────────┐
│ skip │ │ LLM │ Synthesises narrative + chains from the
│ LLM │ │ synth │ fired signals + a narrow slice of raw data
└──┬───┘ └────┬────┘
│ │
▼ ▼
"All clear ✓" Full report
report + emailThe detectors today
| Detector | Fires when |
|---|---|
WeekOverWeekCostChange | Any service line jumps > 20% vs the same day last week (with a $5 absolute-change floor so tiny services don't trigger) |
AlbErrorRateSpike | 5xx ratio crosses 1% for ≥1 minute in the analysis window |
AlbLatencyJump | p99 target latency jumps > 2× the rolling 7-day median |
AlbUnhealthyTargets | > 25% of targets unhealthy for ≥3 minutes |
RdsBurstCreditDepletion | Aurora burst-credit balance drops below 25% |
RdsTopSqlChange | Top-3 wait-time queries change vs last analysis |
EbHealthEvents | EB environment emits any Severe or Degraded health event |
CloudTrailWriteSpike | Write-event count > 5× the rolling-7-day-median for the account |
Each detector is a small Python class (≤ 100 lines of code) that takes the collected raw data and emits zero or more Signal objects. Signals carry:
detectornamefiredbooleanseverity(info | low | medium | high | critical)- A short description
- Pointers to the underlying raw data the LLM should see if this signal is the reason the LLM is being called
What the LLM sees when it does run
Only the fired signals + the narrow data slice they point at. Not the whole analysis window. This is the difference between $0.40 per quiet-day LLM call (naive) and $0.04 per noisy-day call (with the layer).
System prompt instructs Claude to synthesise root-cause chains from the signals, not to re-derive the analysis from scratch.
Why store un-fired signals too
Every analysis writes one row per detector to detector_signals — fired or not, with the underlying numbers that led to the decision. Two reasons:
- Recall metric. When you investigate an incident later and discover the daily report missed it, we can ask: did the detectors see the signal but not fire (a threshold issue) or did the data not actually show anything (a real blind spot)? The non-fired signal rows tell us which.
- Tuning. If you tell us "yesterday's report didn't catch X but should have," we can replay the detectors against the historical data with adjusted thresholds and see whether the new thresholds would catch X without firing on a bunch of other false positives.
You can see un-fired detector activity in Workspace → Platform admin → an org → Detector activity (platform-owner only today; per-org access TBD).
Adding a new detector
You can't from the UI — detectors are code (worker/detectors/). If you have a use case that doesn't fit the existing list, email us with:
- The metric / event you want to detect on
- The threshold (or "I don't know, help us figure one out")
- One concrete past incident this detector would have caught
We add detectors regularly based on real customer requests. The current list is conservative-by-design — it's easier to add a detector than to retroactively suppress false positives.