Paperbrief
Architecture

Paperbrief architecture

The 30-second version

Upload → APScheduler worker picks it up → parse text → chunk → embed (OpenAI) → store in pgvector → ready to chat. Chat: embed question → top-K vector search → Claude synthesises answer with citations as a separate SSE event.

What runs where

Same single Lightsail box as Cloudbrief (see Cloudbrief architecture). Paperbrief lives in the same FastAPI process + APScheduler worker, but with its own tables, its own pipeline, its own per-org access toggle.

Ingestion lifecycle

┌───────────────────────────┐
│  POST /api/paperbrief/    │
│  documents (multipart)    │  ← from the frontend Upload button
└──────────┬────────────────┘


┌───────────────────────────┐
│  API:                     │
│    • validate size + type │
│    • create document row  │  status="pending"
│    • upload bytes to S3   │  paperbrief/{org_id}/{doc_id}/...
│    • commit               │
│    • return doc to user   │
└──────────┬────────────────┘
           │  user sees status=pending; UI polls every 5s


┌───────────────────────────┐
│  Worker (every 30s):      │
│    SELECT FROM            │
│      paperbrief_documents │
│    WHERE status='pending' │
│    LIMIT 5                │
└──────────┬────────────────┘
           │  for each pending doc

┌───────────────────────────┐
│  Ingester:                │
│    • flip → processing    │
│    • S3 download bytes    │
│    • parse → text + pages │
│    • if no text + OCR ok: │
│        Claude vision OCR  │
│    • language detect      │
│    • chunk (RecursiveCT)  │
│    • embed (OpenAI batch) │
│    • bulk-insert chunks   │
│    • flip → ready         │
│    • log usage events     │
└──────────┬────────────────┘


   status=ready in UI;
   checkbox enabled

Failures along the way flip status to failed with a user-facing status_message. No partial state — chunks are either fully inserted or not at all.

Chat lifecycle

┌─────────────────────────┐
│  POST /api/paperbrief/  │
│  chat (SSE stream)      │
└──────────┬──────────────┘


┌─────────────────────────┐
│  • emit conversation    │  ← so UI can update URL right away
│    event with conv id   │
│  • embed question       │  OpenAI text-embedding-3-small
│  • vector search        │  pgvector cosine, scoped by org_id
│  • emit citations event │  ← UI renders source pills
└──────────┬──────────────┘


┌─────────────────────────┐
│  Anthropic streaming:   │
│  system + user message  │
│  with citations as ctx  │
│                         │
│  for each token:        │
│    emit token event     │  ← UI appends to visible answer
│                         │
│  emit usage event       │  ← UI uses for cost display
│  emit done event        │
└──────────┬──────────────┘


   persist user msg + assistant msg
   to paperbrief_messages
   write usage event row

Data shapes

organizations
  ├─ org_members (FK CASCADE)
  ├─ paperbrief_documents (FK CASCADE)
  │    └─ paperbrief_chunks (FK CASCADE; embedding vector(1536))
  ├─ paperbrief_conversations (FK CASCADE)
  │    └─ paperbrief_messages (FK CASCADE)
  └─ paperbrief_usage_events (FK CASCADE)

Every multi-tenant table has org_id. Foreign keys are explicit; cascades are explicit. No ORM "soft" relationships.

pgvector index

Single ivfflat index on paperbrief_chunks.embedding with vector_cosine_ops, lists=100. Tuned for up to ~1M vectors. We'll migrate to HNSW when we cross that threshold (a one-shot reindex; no app code change).

The query plan filters by org_id + document_id ANY(...) BEFORE the ANN scan — so the search space is always small per request, not "all chunks in the system".

What's sent to AI providers

For one chat turn:

  • OpenAI text-embedding-3-small receives: your question text. Nothing else.
  • Anthropic Claude Sonnet receives: the system prompt (one paragraph instructing it to ground answers), the top-K chunk texts (with document title + chunk index labels for the model's context), and your question.

For one OCR pass:

  • Anthropic Claude Sonnet receives: a base64 PNG of one page + a transcription prompt. No other context.

We don't send PII metadata (your name, email, org name) to providers. The document/chunk content itself is whatever you uploaded.

Why this scales

The pipeline is embarrassingly parallel — adding worker capacity means processing more pending docs concurrently. Chat is stateless per-request; the only scaling bottleneck is the LLM API itself.

At our current scale (≤ 10 orgs), one Lightsail box handles everything comfortably. See Cloudbrief architecture → Scaling milestones.

Repo layout (relevant slice)

PathWhat
backend/app/paperbrief/parsers/PDF / DOCX / TXT extractors
backend/app/paperbrief/chunker.pyRecursiveCharacterTextSplitter w/ Indic separators
backend/app/paperbrief/embedder.pyOpenAI embeddings (batched, retried)
backend/app/paperbrief/ocr.pyClaude vision OCR
backend/app/paperbrief/storage.pyS3 wrapper, per-org prefix
backend/app/paperbrief/ingest.pyOrchestrator
backend/app/paperbrief/retrieval.pypgvector cosine search
backend/app/paperbrief/chat.pyClaude streaming + SSE events
backend/app/paperbrief/usage.pyPer-event cost recording
backend/app/api/paperbrief.pyHTTP endpoints
worker/jobs/paperbrief_ingester.py30-second polling job