How OGuardAI processes PDF files with quality tiers, preprocessing tips, and known limitations

OGuardAI processes PDF files through POST /v1/transform/file. Text extraction quality varies significantly depending on PDF type. This document describes what to expect at each tier and how to get the best results.

Quality Tiers

Tier 1: Text-Based PDFs (Highest Quality)

PDFs created by word processors, report generators, or "Save as PDF" workflows contain embedded text. OGuardAI extracts this text directly using pdf-extract, preserving all characters exactly as authored. PII detection is reliable.

Expect: Near-perfect extraction. All emails, phone numbers, SSNs, IBANs, and names are detected accurately.

Tier 2: Scanned PDFs with Clean OCR (Good Quality)

Scanned documents require Tesseract OCR (a separate dependency). Quality depends on scan resolution and image clarity. 300+ DPI with black text on white background yields the best results.

Expect: Good extraction for clean scans. Some character substitutions may occur (e.g., 0 vs O, 1 vs l), which can cause missed PII.

Note: The current pipeline does not automatically fall back to OCR for scanned PDFs submitted via /v1/transform/file. Scanned PDFs with no embedded text will return empty or garbled text from the text-extraction layer. To process scanned pages, convert them to images first and use /v1/transform/image.

Tier 3: Mixed PDFs (Variable Quality)

Some PDFs contain both text-based and scanned pages. The pdf-extract library extracts embedded text from all pages, but scanned pages yield empty or garbled output. There is currently no automatic per-page OCR fallback.

Expect: Text pages extract well. Scanned pages may produce no usable text. PII on scanned pages may be missed entirely.

Tier 4: Complex Layouts (Limited)

Multi-column layouts, tables, forms, and overlapping text boxes often produce merged or reordered text. Handwriting is not supported.

Expect: Text is extracted but word order may be wrong. Table cells may merge across columns. PII split across layout regions may not be detected.

Preprocessing Recommendations

Setting	Recommendation
Resolution	300 DPI minimum, 600 DPI for small text
Color	Grayscale or black-on-white for OCR
Orientation	Correct rotation before upload
Format	Use text-based PDF when possible (not scanned images)
File size	Server enforces a configurable upload size limit
Multi-column	Convert to single-column text before upload if feasible

API Usage

Transform a PDF file:

curl -F "file=@document.pdf" http://localhost:3000/v1/transform/file

With optional policy and language:

curl -F "file=@document.pdf" \
     -F "policy=strict-pii" \
     -F "language=de" \
     http://localhost:3000/v1/transform/file

Response Fields

The file transform response includes document metadata alongside the standard transform output:

Field	Description
`safe_text`	Transformed text with PII replaced by tokens
`entities`	Array of detected entities with types and spans
`session_id`	Session identifier for subsequent rehydration
`session_state`	Encrypted session blob for rehydration
`original_format`	Detected format (e.g., `Pdf`, `Docx`, `Txt`)
`original_size`	Original file size in bytes
`stats`	Processing statistics

Note on quality indicators: The response does not currently include an extraction quality score or confidence metric for PDF text extraction. The original_format field confirms the file was recognized as a PDF, and original_size can help identify suspiciously small files (which may be image-only scans). For image-based processing via /v1/transform/image, OCR confidence scores are available per-word in the bounding box output.

Known Limitations

No automatic OCR fallback for scanned PDFs. The /v1/transform/file endpoint uses pdf-extract for text extraction only. Scanned pages with no embedded text produce empty results. Use /v1/transform/image for scans.
Handwriting is not supported. Tesseract OCR does not reliably read handwritten text.
Multi-column text may merge. Column boundaries are not detected; text from adjacent columns may interleave.
Password-protected PDFs are rejected. Encrypted PDFs return a parse error.
No per-page quality reporting. There is no way to know which pages extracted cleanly and which did not.
PDF forms (AcroForms) field values may not be extracted depending on how the form was authored.

Future Improvements

Planned work includes automatic OCR fallback for pages with no extractable text (hybrid PDF processing) and per-page extraction quality scores in the response.

PDF Support