Chat & citations
What happens when you ask a question
- Your question text is embedded into a 1536-dimensional vector via OpenAI's
text-embedding-3-small. - pgvector does a cosine-similarity search over
paperbrief_chunks, scoped to your org AND the documents you currently have selected. - The top 5 chunks become the context for Claude Sonnet.
- Claude streams an answer that uses ONLY those chunks (system prompt enforces this).
- The chunks are returned as a separate
citationsevent in the stream — not inlined by the model — so the UI can render them as pills under the answer.
The frontend renders tokens as they stream in (~50 tokens/sec on a typical answer).
Why citations can't hallucinate
The naive approach is to tell the model: "after your answer, list the sources you used as [Source: filename, chunk N]". The model sometimes invents source labels that look plausible but don't exist (chunk 4 of a document that only has 3 chunks, for instance).
Paperbrief never does this. The citation list comes from the retrieval step — it's the EXACT chunks the model received as context. The system prompt explicitly says:
Do not include
[Source: ...]markers in your answer — the UI renders citations separately from a structured payload we attach.
So even if Claude hallucinates a source name in the prose, it can't end up in the citations pill list, because that list is built from the pgvector query result, not from the model's text.
What you'll see when there's no good match
Two failure modes:
"I could not find this information in the uploaded documents."
The retrieval step returned chunks but none scored high enough to be relevant. The system prompt instructs Claude to say this exact sentence when context doesn't cover the question — followed by 2–3 suggested questions you could ask instead, based on topics visible in the retrieved chunks.
This is a feature, not a bug. The model could try to make something up; we'd rather it tell you the truth.
Empty retrieval (no chunks at all)
Different message: "I could not find content in the selected documents that closely matches your question." with practical suggestions (rephrase, check the right docs are selected, etc.). Happens when the question is wildly off-topic for the selected docs.
Tuning retrieval
Currently fixed at top-K = 5 chunks. We've found this is the sweet spot for most questions — more chunks means more context but also more chance of distracting the model. If your documents are very long and dense (legal, scientific), you may want more; if very short, fewer. Not configurable per-question today.
To get better answers:
- Select fewer documents per chat (smaller search space = more focused retrieval)
- Ask specific questions, not "tell me about this document" (vague questions retrieve loose matches)
- Use terms that appear in the document (the retrieval is vector-based but still benefits from term overlap)
What the model knows
- The text content of the top-5 chunks for this question
- The system prompt (one paragraph instructing it to ground answers, respond in English, ask clarifying questions when needed)
- The full conversation history is NOT included — every question is answered fresh against the document context. Conversations are persisted for your reference, not as model memory.
Tokens + cost
Per chat round-trip:
- Question embedding: ~10–100 tokens at $0.02/M = essentially free
- LLM input (chunks + question + system prompt): ~1500–4000 tokens at $3/M = $0.005–$0.012
- LLM output (the answer): ~200–1000 tokens at $15/M = $0.003–$0.015
Typical question: ~$0.01–$0.02 each. See Usage & billing for live numbers and how they roll up.
Streaming details
The chat endpoint uses Server-Sent Events (SSE). Frame types:
| Type | Payload | Purpose |
|---|---|---|
conversation | {conversation_id} | Emitted first when a new conversation is created |
citations | {citations: [...]} | Emitted once, before tokens — UI renders the source pills immediately |
token | {text: "..."} | One per streamed token, appended to the visible answer |
usage | {embedding_tokens, prompt_tokens, completion_tokens} | Emitted once after tokens, used to compute cost |
error | {text: "..."} | If something goes wrong mid-stream; UI shows as an error toast |
done | (no payload) | Signals end-of-stream; UI should close the EventSource |