Detailed data flow diagrams for every major OGuardAI operation, showing where raw PII exists and what crosses the trust boundary

This document describes the data flow for every major operation in OGuardAI. Each diagram shows where raw PII exists (Trusted Zone only) and what data crosses the trust boundary.

1. Transform Flow

The transform flow takes raw user input, detects and replaces sensitive entities with semantic tokens, and returns safe text with an encrypted session blob.

Client Request
  |  { input: "Contact julia@firma.de", policy: "default" }
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                                                                |
|  +----------+    Request validated, API key / JWT checked      |
|  |   Auth   |--> Scopes verified (requires: Transform scope)  |
|  +----+-----+                                                  |
|       |                                                        |
|       v                                                        |
|  +------------------+   6 regex patterns scanned               |
|  | Prompt Security   |--> Action: Allow / Strip / Block        |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   Builtin regex (30+ patterns)            |
|  | Detect Entities   | + Python NER sidecar (optional)         |
|  |                   |--> Entities: [{email, julia@firma.de,   |
|  |                   |      span: {8,24}, confidence: 0.95}]   |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   Deterministic ID assignment            |
|  |    Tokenize       |   Sort by (start, end, type, value)     |
|  |                   |--> julia@firma.de -> `{{email:e_001}}`    |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   Per-entity-type action lookup          |
|  | Policy Evaluate   |   Level 1: block/remove/hard_mask       |
|  |                   |   Level 2: tokenize (reversible)        |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   Replace entities in text               |
|  |   Transform       |--> "Contact `{{email:e_001}}`"           |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   AES-256-GCM encrypt token map          |
|  |  Seal Session     |   12-byte random nonce, 16-byte tag     |
|  |                   |--> Encrypted blob with tenant + expiry  |
|  +----+-------------+                                          |
|       |                                                        |
+-------+--------------------------------------------------------+
        |
        v
Client Response
  {
    safe_text: "Contact `{{email:e_001}}`",
    session_id: "abc-123",
    session_state: "<encrypted-blob>",
    entities: [{token: "`{{email:e_001}}`", type: "email", ...}],
    stats: {entities_detected: 1, entities_transformed: 1}
  }

What crosses the trust boundary: Only safe_text (tokenized), session_state (encrypted), entity metadata without raw values, and statistics.

2. Rehydrate Flow

The rehydrate flow takes LLM output containing semantic tokens and restores the original values using the encrypted session mapping.

Client Request
  |  { output: "Reply to `{{email:e_001}}`", session_state: "<blob>",
  |    restore_mode: "full", output_channel: "customer_email" }
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                                                                |
|  +----------+    Request validated                             |
|  |   Auth   |--> Scopes verified (requires: Rehydrate scope)  |
|  +----+-----+                                                  |
|       |                                                        |
|       v                                                        |
|  +------------------+   Verify GCM auth tag                    |
|  |  Unseal Session   |   Check tenant ID matches caller        |
|  |                   |   Check TTL not expired                 |
|  |                   |--> Token map: {e_001: julia@firma.de}   |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   3-stage repair pipeline                |
|  |  Token Repair     |   1. Strict parse                       |
|  |                   |   2. Format repair (fix braces, case)   |
|  |                   |   3. Fuzzy resolve (single-candidate)   |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   Apply restore mode per entity type     |
|  |   Rehydrate       |   full: exact value                     |
|  |                   |   partial: J***@firma.de                |
|  |                   |   masked: j*******************e         |
|  |                   |   formatted: julia@firma.de (email)     |
|  |                   |   abstract: (email on file)             |
|  |                   |   none: [REDACTED]                      |
|  |                   |--> "Reply to julia@firma.de"            |
|  +----+-------------+                                          |
|       |                                                        |
|       v                                                        |
|  +------------------+   Re-scan for LLM-hallucinated PII      |
|  |  Output Guard     |   (optional, configurable)              |
|  |                   |--> Action: Block / Mask / Warn          |
|  +----+-------------+                                          |
|       |                                                        |
+-------+--------------------------------------------------------+
        |
        v
Client Response
  {
    restored_text: "Reply to julia@firma.de",
    tokens_resolved: 1,
    tokens_unresolved: [],
    stats: {rehydrate_time_ms: 0.5}
  }

What crosses the trust boundary (inbound): Tokenized text (no raw PII) and encrypted session blob. What crosses the trust boundary (outbound): Restored text (raw PII returns to the authorized caller only).

3. RAG Flow (Retrieval-Augmented Generation)

RAG workflows have four distinct stages, each with its own protection requirements.

                    STAGE 1: DOCUMENT INGESTION
                    ===========================

  Raw Document
    |  "Julia Schneider (julia@firma.de) filed complaint #C-789..."
    |
    v
+-------------------------------------------+
|              TRUSTED ZONE                  |
|                                            |
|  Chunk document (paragraph boundaries)     |
|       |                                    |
|       v                                    |
|  Detect + Tokenize per chunk               |
|  (cross-chunk entity identity preserved)   |
|       |                                    |
|       v                                    |
|  Seal session per document                 |
|  (store blob alongside chunk metadata)     |
+-------+-----------------------------------+
        |
        v
  Vector Store (UNTRUSTED)
    Chunk: "`{{person:p_001}}` (`{{email:e_001}}`) filed complaint #C-789..."
    Session blob: <encrypted>


                    STAGE 2: QUERY TRANSFORM
                    ========================

  User Query
    |  "What happened with Julia Schneider's complaint?"
    |
    v
+-------------------------------------------+
|              TRUSTED ZONE                  |
|                                            |
|  Detect + Tokenize query                   |
|       |                                    |
|       v                                    |
|  "What happened with `{{person:p_001}}`'s    |
|   complaint?"                              |
+-------+-----------------------------------+
        |
        v
  Vector DB Search (UNTRUSTED)
    Query: "What happened with `{{person:p_001}}`'s complaint?"
    Results: matching chunks (all tokenized)


                    STAGE 3: CONTEXT ASSEMBLY
                    =========================

  Retrieved Chunks (tokenized)
    |
    v
+-------------------------------------------+
|              TRUSTED ZONE                  |
|                                            |
|  Shield: verify retrieved chunks are       |
|  properly tokenized (no raw PII leaked     |
|  from vector store corruption)             |
|       |                                    |
|       v                                    |
|  Assemble context for LLM prompt           |
|  (all tokenized, with entity_context       |
|   metadata for semantic hints)             |
+-------+-----------------------------------+
        |
        v
  LLM (UNTRUSTED)
    Prompt: system preamble + tokenized query + tokenized context


                    STAGE 4: ANSWER REHYDRATION
                    ===========================

  LLM Response (tokenized)
    |  "`{{person:p_001}}` filed complaint #C-789 on 2024-01-15..."
    |
    v
+-------------------------------------------+
|              TRUSTED ZONE                  |
|                                            |
|  Token repair (3-stage)                    |
|       |                                    |
|       v                                    |
|  Rehydrate (restore mode per policy)       |
|       |                                    |
|       v                                    |
|  Output guard (scan for hallucinated PII)  |
+-------+-----------------------------------+
        |
        v
  Final Answer
    "Julia Schneider filed complaint #C-789 on 2024-01-15..."

4. Proxy Flow

The proxy mode intercepts calls to any LLM provider API, transparently applying protection.

Application
  |  POST /v1/chat/completions
  |  { messages: [{role: "user", content: "Email julia@firma.de"}] }
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                     (OGuardAI Proxy)                             |
|                                                                |
|  +---------------+                                             |
|  |   Intercept    |  Parse incoming request                    |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+  Detect + tokenize message content          |
|  |  Transform     |  Skip system messages                      |
|  |  Messages      |  Scan user + assistant messages             |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+                                             |
|  |  Forward       |--> Upstream LLM API (OpenAI, Anthropic,    |
|  |  Request       |    Mistral, Bedrock, local model)          |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+  LLM response received                      |
|  |  Receive       |  (contains only tokenized content)         |
|  |  Response      |                                            |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+  Repair + rehydrate tokens                  |
|  |  Rehydrate     |  Apply output channel restore mode         |
|  |  Response      |                                            |
|  +-------+-------+                                             |
|          |                                                     |
+----------+----------------------------------------------------+
           |
           v
Application
  { choices: [{message: {content: "Email julia@firma.de"}}] }

Key property: The application code does not need to change. The proxy acts as a transparent intermediary for the LLM provider URL.

5. Streaming Flow (SSE)

Streaming responses require special handling because tokens may span multiple SSE chunks.

Client Request
  |  POST /v1/transform/stream
  |  { input: "Contact julia@firma.de about order #ORD-456" }
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                                                                |
|  Auth --> Prompt Security --> Detect --> Tokenize --> Seal      |
|                                                                |
|  Transform produces safe_text:                                 |
|    "Contact `{{email:e_001}}` about order `{{order:o_001}}`"       |
|                                                                |
|  +--------------------------+                                  |
|  |  SSE Event Generator      |                                 |
|  |  Chunk 1: "Contact "      |--> SSE: data: {"chunk": ...}   |
|  |  Chunk 2: "`{{email:e_001}}`|--> SSE: data: {"chunk": ...}   |
|  |  Chunk 3: " about order " |--> SSE: data: {"chunk": ...}   |
|  |  Chunk 4: "`{{order:o_001}}`|--> SSE: data: {"chunk": ...}   |
|  |  Final:   session_state   |--> SSE: data: {"session": ...} |
|  +--------------------------+                                  |
|                                                                |
+---------------------------------------------------------------+


Client Request (Streaming Rehydrate)
  |  POST /v1/rehydrate/stream
  |  { session_state: "<blob>" }
  |  + SSE chunks from LLM response
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                                                                |
|  Unseal session --> Token map ready                            |
|                                                                |
|  +----------------------------------------------+             |
|  |  TokenBoundaryBuffer                          |             |
|  |                                               |             |
|  |  Incoming SSE chunk: "Reply to {em"          |             |
|  |  Buffer: holds partial token "{em"           |             |
|  |  (waits for complete token boundary)          |             |
|  |                                               |             |
|  |  Next chunk: "ail:e_001}} today"              |             |
|  |  Buffer completes: "`{{email:e_001}}`"          |             |
|  |  Rehydrate: "julia@firma.de"                  |             |
|  |  Emit: "Reply to julia@firma.de today"        |             |
|  +----------------------------------------------+             |
|                                                                |
|  Each completed chunk --> SSE output event                     |
|                                                                |
+---------------------------------------------------------------+
        |
        v
Client receives SSE stream with rehydrated content

Key property: The TokenBoundaryBuffer ensures that tokens split across SSE chunk boundaries are correctly reassembled before rehydration. Partial tokens are designed to be held until complete and are not emitted to the client.

6. File / Document Processing Flow

File upload processing for documents (PDF, DOCX, TXT, CSV, email threads).

Client Request
  |  POST /v1/transform/file
  |  Content-Type: multipart/form-data
  |  file: contract.pdf
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                                                                |
|  +---------------+                                             |
|  |  File Ingest   |  Extract text from PDF/DOCX/TXT/CSV       |
|  |                |  (Rust-native parsers, no external calls)  |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+                                             |
|  |  Chunk         |  Split into paragraph-boundary chunks      |
|  |                |  Preserve document structure                |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+  Per-chunk detection + tokenization         |
|  |  Transform     |  Cross-chunk entity identity (same person  |
|  |  per chunk     |  in chunk 1 and chunk 5 = same token)      |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+                                             |
|  |  Seal Session  |  Single session for entire document        |
|  +-------+-------+                                             |
|          |                                                     |
+----------+----------------------------------------------------+
           |
           v
Client Response
  {
    safe_text: "<entire document with tokens>",
    chunks: [{text: "...", entities: [...]}],
    session_state: "<encrypted-blob>",
    stats: {chunks: 12, entities_detected: 8}
  }

7. Image Processing Flow (OCR)

Image processing with optional OCR text extraction.

Client Request
  |  POST /v1/transform/image
  |  Content-Type: multipart/form-data
  |  file: business_card.png
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                                                                |
|  +---------------+  Tesseract OCR (optional module)            |
|  |  OCR Extract   |  Random temp file name (RAII cleanup)      |
|  |                |--> Extracted text: "Julia Schneider         |
|  |                |    +49 30 12345678 julia@firma.de"          |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+                                             |
|  |  Detect +      |  Standard detection pipeline               |
|  |  Tokenize      |                                            |
|  +-------+-------+                                             |
|          |                                                     |
|          v                                                     |
|  +---------------+                                             |
|  |  Transform +   |                                            |
|  |  Seal          |                                            |
|  +-------+-------+                                             |
|          |                                                     |
+----------+----------------------------------------------------+
           |
           v
Client Response
  {
    safe_text: "`{{person:p_001}}` `{{phone:ph_001}}` `{{email:e_001}}`",
    session_state: "<encrypted-blob>",
    ocr_confidence: 0.87
  }

Security note: OCR temp files use random file names and are cleaned up via RAII (Drop trait). No raw content persists on disk after the request completes.

8. Batch Processing Flow

Batch API for processing multiple inputs in a single request.

Client Request
  |  POST /v1/batch/transform
  |  { items: [
  |    {input: "Email: alice@example.com"},
  |    {input: "Call Bob at +1-555-0123"},
  |    {input: "IBAN: DE89370400440532013000"}
  |  ]}
  |
  v
+---------------------------------------------------------------+
|                        TRUSTED ZONE                            |
|                                                                |
|  +----------------------------------------------+             |
|  |  For each item (parallel where possible):     |             |
|  |    1. Detect entities                         |             |
|  |    2. Tokenize                                |             |
|  |    3. Policy evaluate                         |             |
|  |    4. Transform                               |             |
|  +-------+--------------------------------------+             |
|          |                                                     |
|          v                                                     |
|  +---------------+  Merge all token maps                       |
|  |  Seal Session  |  Single session for entire batch           |
|  +-------+-------+                                             |
|          |                                                     |
+----------+----------------------------------------------------+
           |
           v
Client Response
  {
    items: [
      {safe_text: "Email: `{{email:e_001}}`", entities: [...]},
      {safe_text: "Call `{{person:p_001}}` at `{{phone:ph_001}}`", entities: [...]},
      {safe_text: "IBAN: `{{iban:ib_001}}`", entities: [...]}
    ],
    session_state: "<encrypted-blob>",
    stats: {total_entities: 4, items_processed: 3}
  }

Data Classification Summary

Data Type	Where It Exists	Encryption
Raw PII (original values)	Trusted Zone only (in-memory during request)	Designed to prevent unencrypted persistence
Token map (token ID to value)	Sealed session blob	AES-256-GCM
Tokenized text	Untrusted Zone (LLMs, vector stores, logs)	Not encrypted (contains no PII)
Entity metadata (type, gender, formality)	Untrusted Zone (entity_context)	Not encrypted (contains no PII)
Session blob	Client-carried (sealed) or server-stored (Redis)	AES-256-GCM (sealed) or Redis encryption
Audit events	Log output	No PII present; encryption per SIEM policy

Data Flow Diagrams