Cloudbrief architecture

The 30-second version: Cloudbrief is a FastAPI backend + a Next.js static-export frontend + an APScheduler worker, all running on a single AWS Lightsail box. The worker assumes a role into your account once a day, pulls a defined data set, runs the detector layer, optionally invokes Claude for synthesis, stores the result.

The pieces

                                            ┌───────────────────────────┐
   browser ── HTTPS ── Cloudflare edge ──── │ Lightsail (ap-south-1)    │
                       (proxied, strict)    │                           │
                       cert: CF Origin CA   │  nginx (TLS terminate)    │
                                            │   │                       │
                                            │   ├─→ static frontend     │
                                            │   │   /var/www/analyzer/  │
                                            │   │                       │
                                            │   └─→ /api/* → FastAPI    │
                                            │       (uvicorn :8000)     │
                                            │           │               │
                                            │           ▼               │
                                            │    PostgreSQL 16 +        │
                                            │    pgvector (local)       │
                                            │                           │
                                            │    APScheduler worker     │
                                            │    (heartbeat, daily,     │
                                            │     weekly digest, etc.)  │
                                            └────┬──────────────────────┘
                                                 │
                                                 │ STS:AssumeRole
                                                 ▼
                                            ┌────────────────────────────┐
                                            │  YOUR AWS account          │
                                            │  (cross-account read-only) │
                                            │                            │
                                            │  Cost Explorer · CloudWatch│
                                            │  PI · CloudTrail · ALB · EB│
                                            └────────────────────────────┘
                                                       │
                                                       │  Synthesis data
                                                       ▼
                                            ┌────────────────────────────┐
                                            │  Anthropic Claude API      │
                                            │  (only when detectors fire)│
                                            └────────────────────────────┘

What runs where

Component	Lives	Purpose
Frontend SPA	Cloudflare CDN edge (HTML/JS/CSS)	Renders the dashboard, reports, investigations UI
API	AWS Lightsail (`analyzer-api` systemd unit)	Auth, CRUD, on-demand investigation triggers
Worker	AWS Lightsail (`analyzer-worker` systemd unit)	Scheduled jobs: daily analysis, weekly digest, Paperbrief ingestion
Database	AWS Lightsail (Postgres 16, local)	Everything multi-tenant lives here
Report blob storage	S3 `analyzer-reports-*`	Synthesised reports, retained forever
Backups	S3 `analyzer-backups-*`	Daily DB backups (retained 30d)
Secrets	AWS SSM Parameter Store	API keys, DB password, pgcrypto key, TLS cert + key

Daily analysis lifecycle

09:00 IST cron fires in the worker for every aws_accounts row where daily_enabled = true.
Worker decrypts the AWS credentials (column-encrypted via pgcrypto), assumes the cross-account role via STS.
Each collector runs in parallel: Cost Explorer for the last 8 days, CloudWatch metrics for relevant resources, PI for top SQL, etc.
Raw collected data is fed to every detector in worker/detectors/. Each detector emits zero or more Signal rows into detector_signals (fired or not — un-fired rows are kept for recall analysis).
If any detector fires:
- The synthesis prompt is built (system + fired signals + narrow data slices)
- Claude is called with model = claude-sonnet-4-6, streaming off
- Response is parsed for findings + root-cause chains
- A analysis_runs row + findings rows are written
- Report is rendered to HTML, uploaded to S3
- Email is sent via SES to the account's recipients (if any configured)
If no detector fired:
- A analysis_runs row is written with status "all_clear"
- No LLM call, no report HTML, no email content beyond a one-liner
- Total cost: ~$0.000

Cost model

Per AWS account per day:

Data collection: free (CloudWatch / Cost Explorer API limits are well within free)
Detector layer: free (pure Python, runs in worker)
LLM synthesis: $0.04–$0.20 per fired-signal day, $0 per quiet day
SES email: $0.0001 per email, negligible

Typical monthly spend per AWS account: $0.50–$2 if mostly-quiet, $2–$8 if you actively investigate every week.

Storage shapes

organizations
  ├─ org_members (FK CASCADE)
  └─ aws_accounts (FK CASCADE)
       └─ analysis_runs (FK CASCADE)
            ├─ findings (FK CASCADE)
            └─ detector_signals (FK CASCADE)

Every multi-tenant table has org_id (or transitively belongs to one), enforced at the FK level. No table is shared between Cloudbrief and Paperbrief.

Why a single Lightsail box

We started cheap. At our current scale (≤ 10 organisations), one box runs the entire platform comfortably:

1 vCPU, 1 GB RAM (Lightsail small_3_1)
~$12/month
DB + worker + API all on the same host

Scaling milestones, in order of when we'd hit them:

~50 organisations: split DB to RDS Postgres (single instance, no replicas yet). Application code change: zero (DATABASE_URL env var).
~200 organisations: split worker to its own box so DB I/O during heavy ingest doesn't block API responses.
~1000 organisations: move from Lightsail to ECS+Fargate so we can scale horizontally and run multiple worker replicas.

We're nowhere near 50 today. Premature horizontal scaling would just cost more.

Repo layout

Path	What
`backend/app/`	FastAPI + SQLAlchemy + Alembic migrations
`worker/`	APScheduler entry, detectors, collectors, analyzers
`frontend/`	Next.js static-export SPA
`nginx/`	TLS / vhost config installed on the box
`scripts/`	bootstrap, deploy, backup, restore

Source-of-truth repo: github.com/manishrgaud7781/aws-connector-ai (opens in a new tab).

How quiet days are free Troubleshooting