Multi-language support
Quick summary
You can upload documents in any language. Answers always come back in English โ the system prompt instructs Claude to read the source content in whatever language it's written in, then answer in English regardless.
Why English-only answers (today)
DockSense, Paperbrief's predecessor, was used heavily for Gujarati documents by English-speaking analysts. The "read in Gujarati, answer in English" mode was the killer feature โ it's the workflow we tested most thoroughly.
If you want answers in a specific non-English language, we can adjust the system prompt per-org โ email us. A general "answer in the document's language" mode is roadmap.
What's tested
Heavily tested:
- Gujarati (เชเซเชเชฐเชพเชคเซ) โ DockSense's primary corpus. Indic-aware chunking + custom thresholds for language detection inherited.
- English โ obviously.
- Mixed Gujarati + English documents โ common in Indian government / business forms.
Lightly tested but should work:
- Hindi, Marathi, Bengali, Tamil (other Indic scripts share the chunking patterns)
- French, Spanish, German, Portuguese (Latin-script Western European)
- Arabic, Hebrew (RTL โ chunking works but UI rendering isn't optimised)
Not tested:
- CJK (Chinese, Japanese, Korean) โ should work because OpenAI embeddings handle them, but the chunker uses character-count thresholds tuned for Latin / Indic and may produce too-large chunks.
- Vertical scripts.
Language detection
At ingest time we run langdetect (opens in a new tab) plus a Gujarati script-ratio check on the extracted text. The document gets a language label:
gujaratiโ Gujarati script density โฅ 18% OR langdetect saysguwith high confidencemixedโ Gujarati script density 4โ18%, OR langdetect saysguwith weaker confidenceotherโ anything else (mostly Latin-script docs)unknownโ empty text or detection failed
The label shows up as a badge on the document card in the library. It's metadata only โ it doesn't affect retrieval.
Indic-aware chunking
The RecursiveCharacterTextSplitter we use breaks on (in order):
\n\n(paragraph)\n(line)เฅค(Gujarati danda โ sentence terminator)เฅฅ(double-danda โ verse terminator).(Latin period+space),(Latin comma+space)(any space)""(character split as last resort)
Without the danda separators, a long Gujarati paragraph would be split mid-sentence at arbitrary character boundaries โ destroying the semantic units retrieval depends on. The fix is small but matters a lot for Indic content.
OCR for non-Latin scanned PDFs
Claude Sonnet vision handles non-Latin scripts well โ Gujarati / Devanagari OCR quality is roughly on par with Latin. Arabic + CJK are likely similar but we haven't measured.
Embedding quality across languages
OpenAI's text-embedding-3-small is multilingual but English-biased. In practice:
- For Latin-script European languages: nearly as good as English
- For Gujarati: works well enough for high-precision search; loose matches are noisier than English
- For CJK: should be reasonable, untested
If you have a corpus where the embedding quality feels off, tell us โ we can compare against text-embedding-3-large (more expensive, better multilingual) for your specific case.