Incident hunting
A customer told you about a 5xx spike from yesterday and you didn't have an alarm? Cloudbrief's incident hunter scans up to 30 days of historical metrics and finds the windows.
How it works
Cloudbrief → Incidents → "Hunt for the last 30 days".
The hunter runs an anomaly scanner over:
- ALB 5xx error rate
- ALB target latency
- ALB target health (un-healthy host fraction)
For each metric, it identifies windows where the value crosses a per-metric threshold for at least 60 seconds and the surrounding period is otherwise quiet. Each window becomes a candidate incident with:
- Severity score (0–10), derived from peak magnitude × duration
- Time range
- Affected resource (load balancer + target group)
- Top-line summary
Candidates show up in the Incidents tab sorted by severity.
Investigating a candidate
Click any candidate to launch a focused investigation. It pulls:
- ALB metrics in a wider window around the incident
- EB DescribeEvents for the same period (deploys, scaling, health changes)
- Access-log analysis for the period (if access logs are configured for that load balancer)
- CloudTrail write events for the period
Then synthesises a narrative answer to:
- What broke?
- When did it start, when did it recover?
- What's the most likely cause?
Takes 30–60 seconds. The output is a normal Report row in Reports.
Why this matters
You don't have alarms on everything. CloudWatch alarms are expensive to set up well, expensive to maintain (false positives, threshold drift), and you'll always discover incidents in retrospect because no one had the foresight to alarm on the specific metric that broke.
Incident hunting flips the model: instead of expecting you to predict every failure mode and alarm on it, we mine the metrics you already have and tell you which windows look like incidents — then you decide whether to investigate.
Limits
- Window: last 30 days. We don't have older CloudWatch metric data on most accounts because of the standard retention.
- Resources: ALBs only today. EB environment-level metrics + RDS performance anomalies are on the roadmap.
- The scanner is conservative — it surfaces fewer candidates than a human reviewer would. If a known incident isn't being detected, email us with the window + ARN and we'll tune the thresholds.
Cost
Each candidate investigation is one LLM call (~$0.05). The hunting scan itself is a series of CloudWatch GetMetricData calls and is essentially free (well under $0.01).
Weekly digest
If you enable the weekly digest at the org level (Workspace → Usage → ... TBD UI for this), every Sunday at 19:00 IST we run the hunt across all your accounts and email a roll-up of the week's candidates with severity scores. Useful for spotting patterns across accounts.