On-call playbook for OGuardAI alerts. Identify the alert, follow the matching section.

Severity Definitions

Severity	Response Time	Alerts
critical	15 min	PromptSecuritySpike
warning	30 min	HighErrorRate, HighTransformLatency, OutputGuardSpike, RateLimitSaturation
info	Next business day	NoTraffic, EntityDetectionSpike

OGuardAIHighErrorRate

Trigger: rate(guardai_errors_total[5m]) > 0.1 for 2 min.

Check logs for failing endpoint:

kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=200 | grep ERROR
journalctl -u oguardai-server --since "10 min ago" --priority=err   # systemd

Check detector sidecar health:

kubectl exec deploy/guardai -- curl -sf http://localhost:9090/health

If detector is down, restart: kubectl rollout restart deployment/guardai
If errors are on /v1/rehydrate, check for expired session keys. See key-rotation runbook.

OGuardAIHighTransformLatency

Trigger: guardai_transform_duration_seconds{quantile="0.95"} > 1.0 for 5 min.

Check detector sidecar logs: kubectl logs -l app.kubernetes.io/name=guardai -c detector --tail=100
Check CPU pressure: kubectl top pods -l app.kubernetes.io/name=guardai
If detector CPU saturated, scale: kubectl scale deployment/guardai --replicas=4

Check Redis latency (if using Redis sessions):

kubectl exec deploy/guardai-redis -- redis-cli --latency -h localhost

OGuardAIPromptSecuritySpike

Trigger: rate(guardai_prompt_security_triggers_total[5m]) > 1.0 for 1 min. CRITICAL.

Review blocked inputs:

kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=300 | grep 'prompt_security'

Find the source tenant:

kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=500 \
  | grep 'prompt_security' | grep -oP 'tenant=\S+' | sort | uniq -c | sort -rn

If active attack: block source at ingress/WAF level.
If false positive from new integration: adjust prompt security thresholds.
Escalate to L3 if pattern is novel or sustained.

OGuardAIOutputGuardSpike

Trigger: rate(guardai_output_guard_triggers_total[5m]) > 0.5 for 2 min.

Check which entity types are being caught:

kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=200 | grep 'output_guard'

Review active policy: kubectl exec deploy/guardai -- cat /app/policies/default/policy.yaml
If LLM is hallucinating PII, check for recent model updates at the provider.
If new entity type not covered, update policy. See policy-rollback runbook for safe steps.

OGuardAIRateLimitSaturation

Trigger: rate(guardai_rate_limit_rejections_total[5m]) > 10 for 1 min.

Check which tenants: kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=300 | grep 'rate_limit'
If one tenant dominates: contact tenant or raise their limit.

If all tenants hit limits, scale:

kubectl scale deployment/guardai --replicas=6
kubectl patch hpa guardai -p '{"spec":{"maxReplicas":12}}'

OGuardAINoTraffic

Trigger: rate(guardai_transforms_total[5m]) == 0 for 5 min.

Health check: curl -sf http://guardai.internal:3000/v1/health
Check pod status: kubectl get pods -l app.kubernetes.io/name=guardai
Check routing: kubectl describe ingress guardai && kubectl get endpoints guardai
If healthy but no traffic, verify upstream services have not changed routing.

OGuardAIEntityDetectionSpike

Trigger: rate(guardai_entities_detected_total[5m]) > 100 for 2 min.

Check if this is legitimate high-volume traffic or a scanning attack.
If suspicious, escalate to L2.

Escalation Path

Level	Action	Who
L1	Restart pods, check health, follow this runbook	On-call engineer
L2	Root cause analysis, config changes	Senior on-call / team lead
L3	Code-level debugging, hotfix deployment	Engineering team

Escalate if: issue persists after L1, alert is critical, or root cause unclear after 15 min.

Incident Triage Playbook