Cloudbrief architecture
The 30-second version: Cloudbrief is a FastAPI backend + a Next.js static-export frontend + an APScheduler worker, all running on a single AWS Lightsail box. The worker assumes a role into your account once a day, pulls a defined data set, runs the detector layer, optionally invokes Claude for synthesis, stores the result.
The pieces
βββββββββββββββββββββββββββββ
browser ββ HTTPS ββ Cloudflare edge ββββ β Lightsail (ap-south-1) β
(proxied, strict) β β
cert: CF Origin CA β nginx (TLS terminate) β
β β β
β βββ static frontend β
β β /var/www/analyzer/ β
β β β
β βββ /api/* β FastAPI β
β (uvicorn :8000) β
β β β
β βΌ β
β PostgreSQL 16 + β
β pgvector (local) β
β β
β APScheduler worker β
β (heartbeat, daily, β
β weekly digest, etc.) β
ββββββ¬βββββββββββββββββββββββ
β
β STS:AssumeRole
βΌ
ββββββββββββββββββββββββββββββ
β YOUR AWS account β
β (cross-account read-only) β
β β
β Cost Explorer Β· CloudWatchβ
β PI Β· CloudTrail Β· ALB Β· EBβ
ββββββββββββββββββββββββββββββ
β
β Synthesis data
βΌ
ββββββββββββββββββββββββββββββ
β Anthropic Claude API β
β (only when detectors fire)β
ββββββββββββββββββββββββββββββWhat runs where
| Component | Lives | Purpose |
|---|---|---|
| Frontend SPA | Cloudflare CDN edge (HTML/JS/CSS) | Renders the dashboard, reports, investigations UI |
| API | AWS Lightsail (analyzer-api systemd unit) | Auth, CRUD, on-demand investigation triggers |
| Worker | AWS Lightsail (analyzer-worker systemd unit) | Scheduled jobs: daily analysis, weekly digest, Paperbrief ingestion |
| Database | AWS Lightsail (Postgres 16, local) | Everything multi-tenant lives here |
| Report blob storage | S3 analyzer-reports-* | Synthesised reports, retained forever |
| Backups | S3 analyzer-backups-* | Daily DB backups (retained 30d) |
| Secrets | AWS SSM Parameter Store | API keys, DB password, pgcrypto key, TLS cert + key |
Daily analysis lifecycle
- 09:00 IST cron fires in the worker for every
aws_accountsrow wheredaily_enabled = true. - Worker decrypts the AWS credentials (column-encrypted via pgcrypto), assumes the cross-account role via STS.
- Each collector runs in parallel: Cost Explorer for the last 8 days, CloudWatch metrics for relevant resources, PI for top SQL, etc.
- Raw collected data is fed to every detector in
worker/detectors/. Each detector emits zero or more Signal rows intodetector_signals(fired or not β un-fired rows are kept for recall analysis). - If any detector fires:
- The synthesis prompt is built (system + fired signals + narrow data slices)
- Claude is called with
model = claude-sonnet-4-6, streaming off - Response is parsed for findings + root-cause chains
- A
analysis_runsrow +findingsrows are written - Report is rendered to HTML, uploaded to S3
- Email is sent via SES to the account's recipients (if any configured)
- If no detector fired:
- A
analysis_runsrow is written with status "all_clear" - No LLM call, no report HTML, no email content beyond a one-liner
- Total cost: ~$0.000
- A
Cost model
Per AWS account per day:
- Data collection: free (CloudWatch / Cost Explorer API limits are well within free)
- Detector layer: free (pure Python, runs in worker)
- LLM synthesis: $0.04β$0.20 per fired-signal day, $0 per quiet day
- SES email: $0.0001 per email, negligible
Typical monthly spend per AWS account: $0.50β$2 if mostly-quiet, $2β$8 if you actively investigate every week.
Storage shapes
organizations
ββ org_members (FK CASCADE)
ββ aws_accounts (FK CASCADE)
ββ analysis_runs (FK CASCADE)
ββ findings (FK CASCADE)
ββ detector_signals (FK CASCADE)Every multi-tenant table has org_id (or transitively belongs to one), enforced at the FK level. No table is shared between Cloudbrief and Paperbrief.
Why a single Lightsail box
We started cheap. At our current scale (β€ 10 organisations), one box runs the entire platform comfortably:
- 1 vCPU, 1 GB RAM (Lightsail
small_3_1) - ~$12/month
- DB + worker + API all on the same host
Scaling milestones, in order of when we'd hit them:
- ~50 organisations: split DB to RDS Postgres (single instance, no replicas yet). Application code change: zero (DATABASE_URL env var).
- ~200 organisations: split worker to its own box so DB I/O during heavy ingest doesn't block API responses.
- ~1000 organisations: move from Lightsail to ECS+Fargate so we can scale horizontally and run multiple worker replicas.
We're nowhere near 50 today. Premature horizontal scaling would just cost more.
Repo layout
| Path | What |
|---|---|
backend/app/ | FastAPI + SQLAlchemy + Alembic migrations |
worker/ | APScheduler entry, detectors, collectors, analyzers |
frontend/ | Next.js static-export SPA |
nginx/ | TLS / vhost config installed on the box |
scripts/ | bootstrap, deploy, backup, restore |
Source-of-truth repo: github.com/manishrgaud7781/aws-connector-ai (opens in a new tab).