Multi-language support

Quick summary

You can upload documents in any language. Answers always come back in English — the system prompt instructs Claude to read the source content in whatever language it's written in, then answer in English regardless.

Why English-only answers (today)

DockSense, Paperbrief's predecessor, was used heavily for Gujarati documents by English-speaking analysts. The "read in Gujarati, answer in English" mode was the killer feature — it's the workflow we tested most thoroughly.

If you want answers in a specific non-English language, we can adjust the system prompt per-org — email us. A general "answer in the document's language" mode is roadmap.

What's tested

Heavily tested:

Gujarati (ગુજરાતી) — DockSense's primary corpus. Indic-aware chunking + custom thresholds for language detection inherited.
English — obviously.
Mixed Gujarati + English documents — common in Indian government / business forms.

Lightly tested but should work:

Hindi, Marathi, Bengali, Tamil (other Indic scripts share the chunking patterns)
French, Spanish, German, Portuguese (Latin-script Western European)
Arabic, Hebrew (RTL — chunking works but UI rendering isn't optimised)

Not tested:

CJK (Chinese, Japanese, Korean) — should work because OpenAI embeddings handle them, but the chunker uses character-count thresholds tuned for Latin / Indic and may produce too-large chunks.
Vertical scripts.

Language detection

At ingest time we run langdetect (opens in a new tab) plus a Gujarati script-ratio check on the extracted text. The document gets a language label:

gujarati — Gujarati script density ≥ 18% OR langdetect says gu with high confidence
mixed — Gujarati script density 4–18%, OR langdetect says gu with weaker confidence
other — anything else (mostly Latin-script docs)
unknown — empty text or detection failed

The label shows up as a badge on the document card in the library. It's metadata only — it doesn't affect retrieval.

Indic-aware chunking

The RecursiveCharacterTextSplitter we use breaks on (in order):

\n\n (paragraph)
\n (line)
। (Gujarati danda — sentence terminator)
॥ (double-danda — verse terminator)
. (Latin period+space)
, (Latin comma+space)
(any space)
"" (character split as last resort)

Without the danda separators, a long Gujarati paragraph would be split mid-sentence at arbitrary character boundaries — destroying the semantic units retrieval depends on. The fix is small but matters a lot for Indic content.

OCR for non-Latin scanned PDFs

Claude Sonnet vision handles non-Latin scripts well — Gujarati / Devanagari OCR quality is roughly on par with Latin. Arabic + CJK are likely similar but we haven't measured.

Embedding quality across languages

OpenAI's text-embedding-3-small is multilingual but English-biased. In practice:

For Latin-script European languages: nearly as good as English
For Gujarati: works well enough for high-precision search; loose matches are noisier than English
For CJK: should be reasonable, untested

If you have a corpus where the embedding quality feels off, tell us — we can compare against text-embedding-3-large (more expensive, better multilingual) for your specific case.

Conversations Architecture