Why OGuardAI?
How OGuardAI compares to Presidio, AWS Comprehend, and Azure AI Language, and when to use each
One-Sentence Answer
OGuardAI runs anywhere, works with any LLM, and your data never leaves your infrastructure.
The Problem
Existing PII tools (Presidio, AWS Comprehend, Azure AI Language) solve detection. They find sensitive data and mask it. But masking destroys information permanently -- there is no way to restore the original values after the LLM responds.
OGuardAI solves a different problem: round-trip data protection for AI pipelines. It replaces sensitive values with semantic tokens that LLMs can reason about, then deterministically restores the originals in the output. Detection is one step in the pipeline, not the entire product.
Comparison Table
| Capability | OGuardAI | Microsoft Presidio | AWS Comprehend | Azure AI Language |
|---|---|---|---|---|
| Self-hosted / air-gapped | Yes | Yes | No (AWS only) | No (Azure only) |
| Works with any LLM | Yes | N/A (detection only) | N/A (detection only) | N/A (detection only) |
| Semantic tokenization (reversible) | Yes | No (irreversible masking) | No | No |
| Round-trip restore | Yes (6 modes) | No | No | No |
| Streaming support (SSE) | Yes | No | No | No |
| RAG pipeline protection | Yes (ingest/query/context/answer) | No | No | No |
| Structured JSON protection | Yes (path-aware) | No | No | No |
| Policy engine | Yes (per-entity, per-channel) | Partial | No | No |
| Output guard (second-pass) | Yes | No | No | No |
| Entity revocation / GDPR delete | Yes (cascade + bulk) | No | No | No |
| Token repair (LLM damage recovery) | Yes (3-stage) | N/A | N/A | N/A |
| Multi-language NER | Yes (GLiNER + spaCy) | Yes (spaCy) | Limited | Yes |
| Latency (p95) | <10ms builtin, <200ms NER | ~50ms | 100-500ms (API) | 100-500ms (API) |
| Open source | Yes (Apache-2.0) | Yes (MIT) | No | No |
| Vendor lock-in | None | None | AWS | Azure |
Architecture Difference
The fundamental difference is one-way masking versus round-trip protection.
Presidio / AWS Comprehend / Azure AI Language
Input -> Detect -> Mask -> (data destroyed)The original values are gone. The masked text goes to the LLM, and the LLM response cannot reference any real data. This works for logging and compliance scanning, but it makes AI responses generic and impersonal.
OGuardAI
Input -> Detect -> Tokenize -> Transform -> [LLM] -> RestoreSensitive values are replaced with typed semantic tokens ({{email:e7a3}},
{{person:a1b2}}). The tokens carry safe metadata (gender, formality, language)
so LLMs generate contextually correct output. After the LLM responds, OGuardAI
deterministically restores the original values based on policy. No data is
destroyed. No data leaves your infrastructure.
Six restore modes control what happens on output:
| Mode | Behavior |
|---|---|
full | Restore original value |
partial | Restore with partial masking (e.g., j***@example.com) |
masked | Replace with consistent mask (e.g., [EMAIL]) |
formatted | Type-specific display (e.g., *** *** 1234 for SSN) |
abstract | Category-level reference (e.g., "a financial identifier") |
none | Strip token entirely |
Different output channels (agent, customer, log, audit) can use different restore modes for the same entity, controlled by policy.
When to Use What
Use OGuardAI when:
- You need reversible protection -- LLM output must contain real data (customer names, emails, account numbers) for the end user, while the LLM itself never sees it.
- You run RAG pipelines and need protection at ingest, query, context assembly, and answer generation.
- You need streaming (SSE) with real-time tokenization and restoration.
- You require self-hosted or air-gapped deployment with zero external calls.
- You work with any LLM provider (OpenAI, Anthropic, Mistral, Bedrock, local models) and do not want vendor lock-in.
- You need structured JSON protection that understands JSON paths and preserves structure.
- You need GDPR entity revocation -- delete a person's data from all sessions in one operation.
- You need policy-driven control over what gets masked, passed through, or blocked, with different rules per output channel.
Use Presidio when:
- You only need detection and irreversible masking -- no LLM round-trip, no restoration.
- You are building compliance scanning or log sanitization where destroyed data is acceptable.
- You do not need streaming, RAG protection, or structured JSON handling.
Use AWS Comprehend or Azure AI Language when:
- You are already committed to that cloud provider and only need entity detection as a service.
- You do not need restoration, streaming, or self-hosted deployment.
- API latency (100-500ms per call) is acceptable for your use case.
Key Differentiators
Semantic tokens, not masks. OGuardAI tokens like {{person:a1b2}} carry type
information and metadata. LLMs understand they refer to a person and generate
grammatically correct, gender-aware, formality-appropriate text around them.
A simple [REDACTED] or **** gives the LLM nothing to work with.
Token repair. LLMs sometimes modify tokens in their output (extra spaces, changed brackets, partial tokens). OGuardAI's 3-stage repair pipeline (strict match, pattern repair, fuzzy resolution) recovers damaged tokens before restoration. Detection-only tools do not face this problem because they never restore.
Output guard. After the LLM responds, OGuardAI optionally re-scans the output for any new PII the model may have hallucinated or leaked. This second-pass protection catches data that was never in the input.
Stateless by design. Session state travels as an encrypted blob (AES-GCM) alongside the request. There is no server-side session store required, though Redis and in-memory backends are planned for a future release. This makes horizontal scaling trivial and air-gapped deployment straightforward.
Provider-neutral. OGuardAI sits between your application and any LLM. It does not depend on OpenAI, Anthropic, AWS, or Azure. Swap providers without changing your protection layer.
Summary
Presidio, AWS Comprehend, and Azure AI Language are detection tools. They answer the question: "Where is the sensitive data?"
OGuardAI is a protection runtime. It answers a different question: "How do I use AI safely with sensitive data and still get useful results?"
If you only need to find and redact PII, the existing tools work well. If you need to protect data through an entire AI pipeline and restore it on the other side, that is what OGuardAI is built for.