Paperbrief
Multi-language support

Multi-language support

Quick summary

You can upload documents in any language. Answers always come back in English โ€” the system prompt instructs Claude to read the source content in whatever language it's written in, then answer in English regardless.

Why English-only answers (today)

DockSense, Paperbrief's predecessor, was used heavily for Gujarati documents by English-speaking analysts. The "read in Gujarati, answer in English" mode was the killer feature โ€” it's the workflow we tested most thoroughly.

If you want answers in a specific non-English language, we can adjust the system prompt per-org โ€” email us. A general "answer in the document's language" mode is roadmap.

What's tested

Heavily tested:

  • Gujarati (เช—เซเชœเชฐเชพเชคเซ€) โ€” DockSense's primary corpus. Indic-aware chunking + custom thresholds for language detection inherited.
  • English โ€” obviously.
  • Mixed Gujarati + English documents โ€” common in Indian government / business forms.

Lightly tested but should work:

  • Hindi, Marathi, Bengali, Tamil (other Indic scripts share the chunking patterns)
  • French, Spanish, German, Portuguese (Latin-script Western European)
  • Arabic, Hebrew (RTL โ€” chunking works but UI rendering isn't optimised)

Not tested:

  • CJK (Chinese, Japanese, Korean) โ€” should work because OpenAI embeddings handle them, but the chunker uses character-count thresholds tuned for Latin / Indic and may produce too-large chunks.
  • Vertical scripts.

Language detection

At ingest time we run langdetect (opens in a new tab) plus a Gujarati script-ratio check on the extracted text. The document gets a language label:

  • gujarati โ€” Gujarati script density โ‰ฅ 18% OR langdetect says gu with high confidence
  • mixed โ€” Gujarati script density 4โ€“18%, OR langdetect says gu with weaker confidence
  • other โ€” anything else (mostly Latin-script docs)
  • unknown โ€” empty text or detection failed

The label shows up as a badge on the document card in the library. It's metadata only โ€” it doesn't affect retrieval.

Indic-aware chunking

The RecursiveCharacterTextSplitter we use breaks on (in order):

  1. \n\n (paragraph)
  2. \n (line)
  3. เฅค (Gujarati danda โ€” sentence terminator)
  4. เฅฅ (double-danda โ€” verse terminator)
  5. . (Latin period+space)
  6. , (Latin comma+space)
  7. (any space)
  8. "" (character split as last resort)

Without the danda separators, a long Gujarati paragraph would be split mid-sentence at arbitrary character boundaries โ€” destroying the semantic units retrieval depends on. The fix is small but matters a lot for Indic content.

OCR for non-Latin scanned PDFs

Claude Sonnet vision handles non-Latin scripts well โ€” Gujarati / Devanagari OCR quality is roughly on par with Latin. Arabic + CJK are likely similar but we haven't measured.

Embedding quality across languages

OpenAI's text-embedding-3-small is multilingual but English-biased. In practice:

  • For Latin-script European languages: nearly as good as English
  • For Gujarati: works well enough for high-precision search; loose matches are noisier than English
  • For CJK: should be reasonable, untested

If you have a corpus where the embedding quality feels off, tell us โ€” we can compare against text-embedding-3-large (more expensive, better multilingual) for your specific case.