RAG Document Pipeline
How a company safely ingests internal documents into a RAG system so the vector store and LLM never see employee or customer PII
How a company safely ingests internal documents into a RAG system so the vector store and LLM never see employee or customer PII.
The Situation
A 2,000-person technology company wants to build an internal knowledge assistant. The knowledge base includes HR policies, employee handbooks, customer contracts, support ticket archives, and internal memos. These documents are full of names, email addresses, phone numbers, employee IDs, customer account numbers, and salary figures.
The engineering team builds a standard RAG pipeline: chunk documents, embed them, store vectors, retrieve relevant chunks at query time, and pass them to an LLM for answer generation. The security review blocks deployment. The vector store (hosted on a managed service) would contain raw PII in the chunk text. The LLM provider would receive PII in retrieved context. Every query could leak sensitive data.
The Solution
OGuardAI is inserted at two points in the RAG pipeline: document ingestion (tokenize before embedding) and query time (transform the question, then rehydrate the answer). The vector store only ever contains tokenized text. The LLM only ever sees tokenized context.
The 4-Step RAG Flow
Step 1: Document Ingestion (Tokenize Before Embedding)
A document from the support archive is being ingested:
From: Sarah Chen <sarah.chen@acmecorp.com>
To: Support Team
Subject: Account #AC-2026-8834 - Billing dispute
Hi team,
Customer James Rodriguez (james.r@outlook.com, phone 415-555-0192)
called about a double charge of $847.50 on invoice INV-44021.
His employee contact at our company is Mike Thompson (ext. 4421).
Please resolve within 24 hours per our SLA.
Sarah Chen
Customer Success ManagerThe ingestion service sends the document to OGuardAI:
curl -X POST http://localhost:3000/v1/transform \
-H "Content-Type: application/json" \
-d '{
"input": "From: Sarah Chen <sarah.chen@acmecorp.com>\nTo: Support Team\nSubject: Account #AC-2026-8834 - Billing dispute\n\nHi team,\n\nCustomer James Rodriguez (james.r@outlook.com, phone 415-555-0192) called about a double charge of $847.50 on invoice INV-44021. His employee contact at our company is Mike Thompson (ext. 4421). Please resolve within 24 hours per our SLA.\n\nSarah Chen\nCustomer Success Manager",
"policy": "rag-ingest"
}'Detected entities:
| Original Value | Entity Type | Token |
|---|---|---|
| Sarah Chen | person | {{person:p_001}} |
| sarah.chen@acmecorp.com | {{email:e_001}} | |
| AC-2026-8834 | customer_id | {{customer_id:c_001}} |
| James Rodriguez | person | {{person:p_002}} |
| james.r@outlook.com | {{email:e_002}} | |
| 415-555-0192 | phone | {{phone:ph_001}} |
| Mike Thompson | person | {{person:p_003}} |
The tokenized text stored in the vector database:
From: `{{person:p_001}}` <`{{email:e_001}}`>
To: Support Team
Subject: Account #`{{customer_id:c_001}}` - Billing dispute
Hi team,
Customer `{{person:p_002}}` (`{{email:e_002}}`, phone `{{phone:ph_001}}`)
called about a double charge of $847.50 on invoice INV-44021.
His employee contact at our company is `{{person:p_003}}` (ext. 4421).
Please resolve within 24 hours per our SLA.
`{{person:p_001}}`
Customer Success ManagerThe session state blob is stored alongside the document chunk metadata in the ingestion database (not in the vector store).
Step 2: Query Transformation
An employee asks the knowledge assistant: "What happened with the billing dispute from James Rodriguez?"
The query is transformed before retrieval:
curl -X POST http://localhost:3000/v1/transform \
-H "Content-Type: application/json" \
-d '{
"input": "What happened with the billing dispute from James Rodriguez?",
"policy": "rag-query"
}'Result:
What happened with the billing dispute from `{{person:q_001}}`?The vector search uses the tokenized query. Because the ingested chunks also contain {{person:...}} tokens, semantic similarity still works -- the embedding model captures the structural pattern around "billing dispute" and the person token.
Step 3: Context Assembly
The retrieval system finds the matching chunk (from Step 1). The tokenized chunk is passed directly to the LLM as context -- it is already safe. No additional transformation is needed for the retrieved context because it was tokenized at ingestion time.
Step 4: Answer Rehydration
The LLM generates an answer using the tokenized context:
Based on the support ticket, `{{person:p_002}}` reported a double charge
of $847.50 on invoice INV-44021 for account `{{customer_id:c_001}}`.
The ticket was assigned to `{{person:p_003}}` for resolution within
24 hours per SLA. The original report was filed by `{{person:p_001}}`
from the Customer Success team.The answer is rehydrated for the requesting employee:
curl -X POST http://localhost:3000/v1/rehydrate \
-H "Content-Type: application/json" \
-d '{
"output": "<LLM answer with tokens>",
"session_state": "<encrypted-blob-from-ingest>",
"output_channel": "internal_employee"
}'The employee sees the fully restored answer:
Based on the support ticket, James Rodriguez reported a double charge
of $847.50 on invoice INV-44021 for account AC-2026-8834. The ticket
was assigned to Mike Thompson for resolution within 24 hours per SLA.
The original report was filed by Sarah Chen from the Customer Success team.Policy Configuration
name: rag-ingest
version: "1.0"
rules:
- entity_type: person
action: tokenize
restore_mode: full
- entity_type: email
action: tokenize
restore_mode: full
- entity_type: phone
action: tokenize
restore_mode: masked
- entity_type: customer_id
action: tokenize
restore_mode: full
- entity_type: ssn
action: block
channel_rules:
internal_employee:
person: { restore_mode: full }
email: { restore_mode: full }
phone: { restore_mode: masked }
customer_id: { restore_mode: full }
external_customer:
person: { restore_mode: partial }
email: { restore_mode: none }
phone: { restore_mode: none }
customer_id: { restore_mode: masked }What OGuardAI Made Possible
Safe vector storage. The managed vector store service contains zero PII. If the vector store is breached, attackers find only semantic tokens that cannot be reversed without the encrypted session blobs stored separately in the company's infrastructure.
End-to-end protection. PII is tokenized once at ingestion and stays tokenized through retrieval, context assembly, and LLM processing. Real values are restored only at the final step, only for authorized users, only according to policy.
Semantic search still works. Tokenized text preserves document structure and surrounding context. The embedding model captures the meaning around tokens ("billing dispute," "double charge," "SLA") so retrieval quality is maintained.
Blocked sensitive categories. SSN values in HR documents are blocked entirely at ingestion -- they never enter the vector store, not even as tokens. The policy enforces this for the entire document corpus.