OperationsRunbooks
Incident Triage Playbook
On-call playbook for OGuardAI alerts with step-by-step response procedures
On-call playbook for OGuardAI alerts. Identify the alert, follow the matching section.
Severity Definitions
| Severity | Response Time | Alerts |
|---|---|---|
| critical | 15 min | PromptSecuritySpike |
| warning | 30 min | HighErrorRate, HighTransformLatency, OutputGuardSpike, RateLimitSaturation |
| info | Next business day | NoTraffic, EntityDetectionSpike |
OGuardAIHighErrorRate
Trigger: rate(guardai_errors_total[5m]) > 0.1 for 2 min.
- Check logs for failing endpoint:
kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=200 | grep ERROR journalctl -u oguardai-server --since "10 min ago" --priority=err # systemd - Check detector sidecar health:
kubectl exec deploy/guardai -- curl -sf http://localhost:9090/health - If detector is down, restart:
kubectl rollout restart deployment/guardai - If errors are on
/v1/rehydrate, check for expired session keys. See key-rotation runbook.
OGuardAIHighTransformLatency
Trigger: guardai_transform_duration_seconds{quantile="0.95"} > 1.0 for 5 min.
- Check detector sidecar logs:
kubectl logs -l app.kubernetes.io/name=guardai -c detector --tail=100 - Check CPU pressure:
kubectl top pods -l app.kubernetes.io/name=guardai - If detector CPU saturated, scale:
kubectl scale deployment/guardai --replicas=4 - Check Redis latency (if using Redis sessions):
kubectl exec deploy/guardai-redis -- redis-cli --latency -h localhost
OGuardAIPromptSecuritySpike
Trigger: rate(guardai_prompt_security_triggers_total[5m]) > 1.0 for 1 min. CRITICAL.
- Review blocked inputs:
kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=300 | grep 'prompt_security' - Find the source tenant:
kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=500 \ | grep 'prompt_security' | grep -oP 'tenant=\S+' | sort | uniq -c | sort -rn - If active attack: block source at ingress/WAF level.
- If false positive from new integration: adjust prompt security thresholds.
- Escalate to L3 if pattern is novel or sustained.
OGuardAIOutputGuardSpike
Trigger: rate(guardai_output_guard_triggers_total[5m]) > 0.5 for 2 min.
- Check which entity types are being caught:
kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=200 | grep 'output_guard' - Review active policy:
kubectl exec deploy/guardai -- cat /app/policies/default/policy.yaml - If LLM is hallucinating PII, check for recent model updates at the provider.
- If new entity type not covered, update policy. See policy-rollback runbook for safe steps.
OGuardAIRateLimitSaturation
Trigger: rate(guardai_rate_limit_rejections_total[5m]) > 10 for 1 min.
- Check which tenants:
kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=300 | grep 'rate_limit' - If one tenant dominates: contact tenant or raise their limit.
- If all tenants hit limits, scale:
kubectl scale deployment/guardai --replicas=6 kubectl patch hpa guardai -p '{"spec":{"maxReplicas":12}}'
OGuardAINoTraffic
Trigger: rate(guardai_transforms_total[5m]) == 0 for 5 min.
- Health check:
curl -sf http://guardai.internal:3000/v1/health - Check pod status:
kubectl get pods -l app.kubernetes.io/name=guardai - Check routing:
kubectl describe ingress guardai && kubectl get endpoints guardai - If healthy but no traffic, verify upstream services have not changed routing.
OGuardAIEntityDetectionSpike
Trigger: rate(guardai_entities_detected_total[5m]) > 100 for 2 min.
- Check if this is legitimate high-volume traffic or a scanning attack.
- If suspicious, escalate to L2.
Escalation Path
| Level | Action | Who |
|---|---|---|
| L1 | Restart pods, check health, follow this runbook | On-call engineer |
| L2 | Root cause analysis, config changes | Senior on-call / team lead |
| L3 | Code-level debugging, hotfix deployment | Engineering team |
Escalate if: issue persists after L1, alert is critical, or root cause unclear after 15 min.