PDF Support
How OGuardAI processes PDF files with quality tiers, preprocessing tips, and known limitations
OGuardAI processes PDF files through POST /v1/transform/file. Text extraction
quality varies significantly depending on PDF type. This document describes what
to expect at each tier and how to get the best results.
Quality Tiers
Tier 1: Text-Based PDFs (Highest Quality)
PDFs created by word processors, report generators, or "Save as PDF" workflows
contain embedded text. OGuardAI extracts this text directly using pdf-extract,
preserving all characters exactly as authored. PII detection is reliable.
Expect: Near-perfect extraction. All emails, phone numbers, SSNs, IBANs, and names are detected accurately.
Tier 2: Scanned PDFs with Clean OCR (Good Quality)
Scanned documents require Tesseract OCR (a separate dependency). Quality depends on scan resolution and image clarity. 300+ DPI with black text on white background yields the best results.
Expect: Good extraction for clean scans. Some character substitutions may
occur (e.g., 0 vs O, 1 vs l), which can cause missed PII.
Note: The current pipeline does not automatically fall back to OCR for
scanned PDFs submitted via /v1/transform/file. Scanned PDFs with no embedded
text will return empty or garbled text from the text-extraction layer. To process
scanned pages, convert them to images first and use /v1/transform/image.
Tier 3: Mixed PDFs (Variable Quality)
Some PDFs contain both text-based and scanned pages. The pdf-extract library
extracts embedded text from all pages, but scanned pages yield empty or garbled
output. There is currently no automatic per-page OCR fallback.
Expect: Text pages extract well. Scanned pages may produce no usable text. PII on scanned pages may be missed entirely.
Tier 4: Complex Layouts (Limited)
Multi-column layouts, tables, forms, and overlapping text boxes often produce merged or reordered text. Handwriting is not supported.
Expect: Text is extracted but word order may be wrong. Table cells may merge across columns. PII split across layout regions may not be detected.
Preprocessing Recommendations
| Setting | Recommendation |
|---|---|
| Resolution | 300 DPI minimum, 600 DPI for small text |
| Color | Grayscale or black-on-white for OCR |
| Orientation | Correct rotation before upload |
| Format | Use text-based PDF when possible (not scanned images) |
| File size | Server enforces a configurable upload size limit |
| Multi-column | Convert to single-column text before upload if feasible |
API Usage
Transform a PDF file:
curl -F "file=@document.pdf" http://localhost:3000/v1/transform/fileWith optional policy and language:
curl -F "file=@document.pdf" \
-F "policy=strict-pii" \
-F "language=de" \
http://localhost:3000/v1/transform/fileResponse Fields
The file transform response includes document metadata alongside the standard transform output:
| Field | Description |
|---|---|
safe_text | Transformed text with PII replaced by tokens |
entities | Array of detected entities with types and spans |
session_id | Session identifier for subsequent rehydration |
session_state | Encrypted session blob for rehydration |
original_format | Detected format (e.g., Pdf, Docx, Txt) |
original_size | Original file size in bytes |
stats | Processing statistics |
Note on quality indicators: The response does not currently include an
extraction quality score or confidence metric for PDF text extraction. The
original_format field confirms the file was recognized as a PDF, and
original_size can help identify suspiciously small files (which may be
image-only scans). For image-based processing via /v1/transform/image, OCR
confidence scores are available per-word in the bounding box output.
Known Limitations
- No automatic OCR fallback for scanned PDFs. The
/v1/transform/fileendpoint usespdf-extractfor text extraction only. Scanned pages with no embedded text produce empty results. Use/v1/transform/imagefor scans. - Handwriting is not supported. Tesseract OCR does not reliably read handwritten text.
- Multi-column text may merge. Column boundaries are not detected; text from adjacent columns may interleave.
- Password-protected PDFs are rejected. Encrypted PDFs return a parse error.
- No per-page quality reporting. There is no way to know which pages extracted cleanly and which did not.
- PDF forms (AcroForms) field values may not be extracted depending on how the form was authored.
Future Improvements
Planned work includes automatic OCR fallback for pages with no extractable text (hybrid PDF processing) and per-page extraction quality scores in the response.