Extending Entities
Add custom entity types to OGuardAI with regex patterns and no fork required
Overview
OGuardAI ships with 15 built-in entity types (18 with NER) (Person, Email, Phone, Company,
CustomerId, Order, Address, Location, Iban, Ssn, Ip, Url, CreditCard,
DateOfBirth, Passport, HealthId). Custom types can be added via regex patterns
using the EntityType::Custom(String) variant -- no fork required.
This guide walks through each file you need to touch and what works automatically versus what needs explicit support.
Compatibility Matrix
| Feature | Auto-compatible | Needs custom code |
|---|---|---|
| Detection | Via pattern | Pattern definition |
| Tokenization | Yes (automatic) | - |
| Transform | Yes (automatic) | - |
| Rehydrate (full mode) | Yes (automatic) | - |
| Rehydrate (formatted/abstract) | No | Custom restore template |
| Output guard | Yes (uses default_action) | Optional per-entity override |
| Policy rules | Yes (entity_type matching) | - |
| Revocation | Yes (automatic) | - |
| Capabilities endpoint | Automatic if added to EntityType | - |
Step-by-Step: Adding a Custom Entity Type
1. Register the Pattern in crates/detector-builtins/src/patterns.rs
Add a new DetectionPattern entry inside the all_patterns() vector. Use the
EntityType::Custom("your_type_name".to_owned()) variant.
DetectionPattern {
entity_type: EntityType::Custom("de_tax_id".to_owned()),
base_confidence: 0.8,
context_words: &["steuer", "identifikation", "steuernummer", "tax"],
regex: OnceLock::new(),
raw_pattern: r"\b\d{2}\s?\d{3}\s?\d{3}\s?\d{3}\b",
value_group: None,
}Key fields: entity_type -- use Custom("name".to_owned()), lowercase with
underscores. base_confidence -- 0.0-1.0, boosted by up to 0.2 when
context_words appear nearby. raw_pattern -- Rust regex, compiled lazily.
value_group -- Some(N) if only capture group N is the entity value (see
DateOfBirth for an example); None when the full match is the entity.
2. Custom Restore Behavior (Optional)
For full, partial (generic), and masked restore modes, custom entities
work automatically -- no changes needed. The generic fallback in
crates/rehydrate/src/restore.rs handles them:
- Full -- Returns the original value as-is.
- Partial -- Shows first char, masks middle, shows last char.
- Masked -- Replaces all interior characters with
*. - Abstract -- Returns
"(custom:de_tax_id on file)".
To add a type-specific formatted or abstract restore, add a match arm in
formatted_restore() or abstract_restore() in crates/rehydrate/src/restore.rs:
EntityType::Custom(ref name) if name == "de_tax_id" => {
format!("Steuer-ID: {}", token.original_value)
}3. Output Guard Override (Optional)
Custom entities inherit the default_action from OutputGuardConfig. To set a
specific action, add an entry to the entity_actions map in your policy config:
output_guard:
enabled: true
mode: strict
default_action: mask
entity_actions:
"custom:de_tax_id": block4. Policy Rules (Automatic)
Policy YAML files match on entity type strings. Custom entities are referenced
as custom:your_name:
rules:
- entity_type: "custom:de_tax_id"
action: tokenize
restore_mode: maskedExample: German Tax ID End-to-End
The German Tax ID (Steuer-Identifikationsnummer, 11 digits, format
XX XXX XXX XXX) is already shipped as a built-in custom pattern -- the code
snippet in step 1 above is its exact definition. After adding the pattern:
- Token output:
{{custom:x_a1b2}}-- thex_prefix is shared by all custom types (seeEntityType::id_prefix()). - Protection level: Defaults to level 2 (reversible tokenization). Override to level 1 via policy rules if needed.
Testing Custom Entities
Add a test in crates/detector-builtins/src/patterns.rs alongside the existing
pattern tests:
#[test]
fn de_tax_id_matches() {
let patterns = all_patterns();
let tax = patterns
.iter()
.find(|p| matches!(&p.entity_type, EntityType::Custom(s) if s == "de_tax_id"))
.expect("de_tax_id pattern exists");
assert!(tax.regex().is_match("12 345 678 901"));
assert!(tax.regex().is_match("12345678901"));
}Validate the full pipeline with cargo test --workspace, then send a live
request to confirm end-to-end tokenization:
curl -X POST http://localhost:3000/v1/transform \
-H "Content-Type: application/json" \
-d '{"input": "My tax ID is 12 345 678 901", "input_type": "text"}'Limitations
- Regex only. The
EntityType::Custompath supports regex-based detection. Entities that require NER, ML models, or context-dependent logic (e.g., person names, company names) need the Python detector sidecar (apps/detector-py/). Extend the sidecar's spaCy or GLiNER pipeline for those cases. - No custom id_prefix. All custom entities share the
x_token prefix. If you need a distinct prefix, add a named variant to theEntityTypeenum incrates/core/src/types.rswith its ownid_prefix()arm. - Protection level. Custom entities default to level 2. To enforce level 1 (block), configure it via policy rules rather than code changes.
- Formatted/abstract restore. Generic fallbacks work but are not type-aware. Add explicit match arms if you need domain-specific formatting.