Supported formats & OCR
Direct text extraction
| Format | Parser | Notes |
|---|---|---|
.pdf | PyMuPDF (fitz) | Text layer extracted page-by-page. Pages preserve order; page numbers feed into citations. |
.docx | python-docx | Paragraphs only. Tables + headers/footers intentionally ignored in v1 (clean retrieval > complete coverage). |
.txt | UTF-8 decode | errors="replace" so weird bytes become � instead of crashing. |
.md / .markdown | UTF-8 decode | Treated as plain text — markdown is NOT rendered for retrieval, just indexed as the raw text. |
Other formats (xlsx, pptx, html, eml, etc.) are not supported today. Tell us if you have a real use case.
File size
- Max 25 MB per file today.
- The chunker caps at 2000 chunks per document — for most well-formed text that's ~300–500 pages. PDFs heavy with images will hit the size cap before the chunk cap.
OCR (scanned PDFs)
If a PDF has no extractable text layer, we run Claude Sonnet vision OCR automatically:
- Each page is rendered to a PNG at 144 DPI by PyMuPDF.
- The image is sent to Claude with a prompt asking for verbatim transcription.
- The transcribed text per page is concatenated and chunked normally.
OCR is capped at 50 pages per document to keep cost bounded. If your PDF is longer, the first 50 are indexed and a warning is logged. Adjust paperbrief_ocr_max_pages in the platform config to raise — currently a config change, not a UI toggle.
Cost of OCR
Each page is ~2000 input tokens (the image) + ~500 output tokens (the transcribed text). At Claude Sonnet rates that's roughly $0.013 per page. A 30-page scanned PDF: ~$0.40 in OCR fees.
When OCR doesn't fire
The PDF parser only triggers OCR when it extracts zero text. A PDF that's mostly images with one stray text element (page numbers, watermark) will NOT trigger OCR even though most of the content is unreadable. Workaround: convert the PDF to image-only externally (e.g. gs -sDEVICE=png16m -r144 ...), then re-upload.
We're considering an explicit "force OCR" toggle on upload — let us know if you want it.
What we don't extract
- Image content within text PDFs. If a slide deck has charts you'd want indexed, we miss those. Convert relevant ones to text first.
- Tables. PyMuPDF extracts table cell text but loses the layout, so questions like "what's in row 5 column 2" don't work well. For tabular data, a CSV/Excel-style ingester is roadmap.
- Code blocks in markdown are extracted as text (no syntax-aware splitting).
- Form fields in PDF forms. Filled values that live in the AcroForm layer aren't picked up by
get_text("text").
Indic / non-Latin scripts
Fully supported — the chunker uses Indic-aware separators (Gujarati danda ।, double-danda ॥, plus standard Latin breaks) so sentence boundaries aren't lost. Language detection labels each document so you know what's in scope — visible in the doc card.
See Multi-language for the details.
Garbled text detection
PDFs that use legacy non-Unicode fonts (common with older Gujarati and Indic-language documents) sometimes produce text dominated by Unicode replacement characters (�). We detect this — if >10% of the extracted text is replacement chars, the upload fails with a clear message:
'document.pdf' appears to use an older non-Unicode font that couldn't be decoded. Re-save it with a Unicode font (e.g. Noto Sans Gujarati for Gujarati docs, or any standard system font).
Workaround: re-export the PDF from the source document with a Unicode font, then re-upload.