Measured on a single instance: 1 CPU core (AMD EPYC 7763), 512 MB RAM, Linux 6.1, Rust release build. All numbers collected with wrk2 (constant throughput) over 60-second runs, 100 concurrent connections.

Throughput (operations/sec)

Operation	Builtin Only	Builtin + NER	Notes
Transform (single)	2,800	45	NER bound by sidecar round-trip
Rehydrate (single)	12,000	12,000	No external calls
Detect (single)	3,200	50	Same detection path as transform
Batch transform (10 items)	1,400 batches/s	30 batches/s	Per-batch, not per-item
Batch transform (50 items)	350 batches/s	15 batches/s	Per-batch, not per-item

Latency by Operation (builtin-only mode)

Operation	p50	p95	p99
Transform	0.8 ms	2.1 ms	4.2 ms
Rehydrate	0.06 ms	0.3 ms	0.7 ms
Detect	0.7 ms	1.8 ms	3.9 ms
Session seal	0.05 ms	0.08 ms	0.12 ms
Session unseal	0.05 ms	0.08 ms	0.12 ms

Latency by Payload Size (builtin-only transform)

Payload	Size	Entities	p50	p95	p99
Small	Under 500 chars	1-3	0.4 ms	1.2 ms	2.8 ms
Medium	1-5 KB	5-15	0.9 ms	2.5 ms	5.1 ms
Large	10-100 KB	20-80	3.2 ms	8.4 ms	14 ms

Latency by Payload Size (builtin + NER transform)

Payload	Size	Entities	p50	p95	p99
Small	Under 500 chars	1-5	18 ms	65 ms	130 ms
Medium	1-5 KB	8-25	45 ms	120 ms	190 ms
Large	10-100 KB	30-100	110 ms	350 ms	680 ms

NER latency is dominated by the Python sidecar (GLiNER model inference + network round-trip). Rehydrate latency is identical in both modes since it performs local string replacement only.

Builtin vs. Builtin + NER: Impact Summary

Metric	Builtin Only	Builtin + NER	Overhead Factor
p50 transform	0.8 ms	45 ms	~56x
p99 transform	4.2 ms	190 ms	~45x
Throughput	2,800 ops/s	45 ops/s	~62x fewer
Entity types	15	18	+person, company, location

The overhead comes entirely from the NER sidecar. If NER is configured but unavailable, each request incurs a 5-second timeout before falling back to builtin-only detection.

Horizontal Scaling

Instances	Builtin Throughput	Builtin + NER Throughput
1	2,800 ops/s	45 ops/s
2	5,500 ops/s	88 ops/s
4	11,000 ops/s	170 ops/s
8	22,000 ops/s	340 ops/s

Scaling is near-linear. Each instance is stateless (sealed session blobs travel with requests). NER scaling requires proportional NER sidecar instances.

Memory Footprint

Component	Memory
Base server process	18 MB
Per active request overhead	~50 KB
Per entity in session blob	~200 bytes
Session blob (100 entities)	~20 KB
Session blob (1,000 entities)	~200 KB
Regex pattern cache (15 builtin types)	~2 MB
Peak under 100 concurrent requests	~35 MB

The Rust runtime has no garbage collector pauses. Memory usage is stable under sustained load.

Methodology

Tool: wrk2 with constant-throughput mode, lua scripts for POST payloads.
Payloads: Synthetic text with realistic PII density (emails, phones, SSNs, IBANs).
Warm-up: 10-second warm-up discarded before measurement.
NER sidecar: Single GLiNER instance on same host, 1 CPU core, 1 GB RAM.
Repeated: Each benchmark run 3 times; median values reported.
No sampling: All requests measured, not sampled.

Numbers will vary with hardware, payload composition, and NER model size. Use these as baseline expectations for capacity planning. Run the included benchmark suite (cargo bench -p guardai-perf) against your own infrastructure for production estimates.

Performance Benchmarks