OGuardAI
Guides

Extending Entities

Add custom entity types to OGuardAI with regex patterns and no fork required

Overview

OGuardAI ships with 15 built-in entity types (18 with NER) (Person, Email, Phone, Company, CustomerId, Order, Address, Location, Iban, Ssn, Ip, Url, CreditCard, DateOfBirth, Passport, HealthId). Custom types can be added via regex patterns using the EntityType::Custom(String) variant -- no fork required.

This guide walks through each file you need to touch and what works automatically versus what needs explicit support.

Compatibility Matrix

FeatureAuto-compatibleNeeds custom code
DetectionVia patternPattern definition
TokenizationYes (automatic)-
TransformYes (automatic)-
Rehydrate (full mode)Yes (automatic)-
Rehydrate (formatted/abstract)NoCustom restore template
Output guardYes (uses default_action)Optional per-entity override
Policy rulesYes (entity_type matching)-
RevocationYes (automatic)-
Capabilities endpointAutomatic if added to EntityType-

Step-by-Step: Adding a Custom Entity Type

1. Register the Pattern in crates/detector-builtins/src/patterns.rs

Add a new DetectionPattern entry inside the all_patterns() vector. Use the EntityType::Custom("your_type_name".to_owned()) variant.

DetectionPattern {
    entity_type: EntityType::Custom("de_tax_id".to_owned()),
    base_confidence: 0.8,
    context_words: &["steuer", "identifikation", "steuernummer", "tax"],
    regex: OnceLock::new(),
    raw_pattern: r"\b\d{2}\s?\d{3}\s?\d{3}\s?\d{3}\b",
    value_group: None,
}

Key fields: entity_type -- use Custom("name".to_owned()), lowercase with underscores. base_confidence -- 0.0-1.0, boosted by up to 0.2 when context_words appear nearby. raw_pattern -- Rust regex, compiled lazily. value_group -- Some(N) if only capture group N is the entity value (see DateOfBirth for an example); None when the full match is the entity.

2. Custom Restore Behavior (Optional)

For full, partial (generic), and masked restore modes, custom entities work automatically -- no changes needed. The generic fallback in crates/rehydrate/src/restore.rs handles them:

  • Full -- Returns the original value as-is.
  • Partial -- Shows first char, masks middle, shows last char.
  • Masked -- Replaces all interior characters with *.
  • Abstract -- Returns "(custom:de_tax_id on file)".

To add a type-specific formatted or abstract restore, add a match arm in formatted_restore() or abstract_restore() in crates/rehydrate/src/restore.rs:

EntityType::Custom(ref name) if name == "de_tax_id" => {
    format!("Steuer-ID: {}", token.original_value)
}

3. Output Guard Override (Optional)

Custom entities inherit the default_action from OutputGuardConfig. To set a specific action, add an entry to the entity_actions map in your policy config:

output_guard:
  enabled: true
  mode: strict
  default_action: mask
  entity_actions:
    "custom:de_tax_id": block

4. Policy Rules (Automatic)

Policy YAML files match on entity type strings. Custom entities are referenced as custom:your_name:

rules:
  - entity_type: "custom:de_tax_id"
    action: tokenize
    restore_mode: masked

Example: German Tax ID End-to-End

The German Tax ID (Steuer-Identifikationsnummer, 11 digits, format XX XXX XXX XXX) is already shipped as a built-in custom pattern -- the code snippet in step 1 above is its exact definition. After adding the pattern:

  • Token output: {{custom:x_a1b2}} -- the x_ prefix is shared by all custom types (see EntityType::id_prefix()).
  • Protection level: Defaults to level 2 (reversible tokenization). Override to level 1 via policy rules if needed.

Testing Custom Entities

Add a test in crates/detector-builtins/src/patterns.rs alongside the existing pattern tests:

#[test]
fn de_tax_id_matches() {
    let patterns = all_patterns();
    let tax = patterns
        .iter()
        .find(|p| matches!(&p.entity_type, EntityType::Custom(s) if s == "de_tax_id"))
        .expect("de_tax_id pattern exists");
    assert!(tax.regex().is_match("12 345 678 901"));
    assert!(tax.regex().is_match("12345678901"));
}

Validate the full pipeline with cargo test --workspace, then send a live request to confirm end-to-end tokenization:

curl -X POST http://localhost:3000/v1/transform \
  -H "Content-Type: application/json" \
  -d '{"input": "My tax ID is 12 345 678 901", "input_type": "text"}'

Limitations

  • Regex only. The EntityType::Custom path supports regex-based detection. Entities that require NER, ML models, or context-dependent logic (e.g., person names, company names) need the Python detector sidecar (apps/detector-py/). Extend the sidecar's spaCy or GLiNER pipeline for those cases.
  • No custom id_prefix. All custom entities share the x_ token prefix. If you need a distinct prefix, add a named variant to the EntityType enum in crates/core/src/types.rs with its own id_prefix() arm.
  • Protection level. Custom entities default to level 2. To enforce level 1 (block), configure it via policy rules rather than code changes.
  • Formatted/abstract restore. Generic fallbacks work but are not type-aware. Add explicit match arms if you need domain-specific formatting.