OGuardAI
OperationsRunbooks

Incident Triage Playbook

On-call playbook for OGuardAI alerts with step-by-step response procedures

On-call playbook for OGuardAI alerts. Identify the alert, follow the matching section.

Severity Definitions

SeverityResponse TimeAlerts
critical15 minPromptSecuritySpike
warning30 minHighErrorRate, HighTransformLatency, OutputGuardSpike, RateLimitSaturation
infoNext business dayNoTraffic, EntityDetectionSpike

OGuardAIHighErrorRate

Trigger: rate(guardai_errors_total[5m]) > 0.1 for 2 min.

  1. Check logs for failing endpoint:
    kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=200 | grep ERROR
    journalctl -u oguardai-server --since "10 min ago" --priority=err   # systemd
  2. Check detector sidecar health:
    kubectl exec deploy/guardai -- curl -sf http://localhost:9090/health
  3. If detector is down, restart: kubectl rollout restart deployment/guardai
  4. If errors are on /v1/rehydrate, check for expired session keys. See key-rotation runbook.

OGuardAIHighTransformLatency

Trigger: guardai_transform_duration_seconds{quantile="0.95"} > 1.0 for 5 min.

  1. Check detector sidecar logs: kubectl logs -l app.kubernetes.io/name=guardai -c detector --tail=100
  2. Check CPU pressure: kubectl top pods -l app.kubernetes.io/name=guardai
  3. If detector CPU saturated, scale: kubectl scale deployment/guardai --replicas=4
  4. Check Redis latency (if using Redis sessions):
    kubectl exec deploy/guardai-redis -- redis-cli --latency -h localhost

OGuardAIPromptSecuritySpike

Trigger: rate(guardai_prompt_security_triggers_total[5m]) > 1.0 for 1 min. CRITICAL.

  1. Review blocked inputs:
    kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=300 | grep 'prompt_security'
  2. Find the source tenant:
    kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=500 \
      | grep 'prompt_security' | grep -oP 'tenant=\S+' | sort | uniq -c | sort -rn
  3. If active attack: block source at ingress/WAF level.
  4. If false positive from new integration: adjust prompt security thresholds.
  5. Escalate to L3 if pattern is novel or sustained.

OGuardAIOutputGuardSpike

Trigger: rate(guardai_output_guard_triggers_total[5m]) > 0.5 for 2 min.

  1. Check which entity types are being caught:
    kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=200 | grep 'output_guard'
  2. Review active policy: kubectl exec deploy/guardai -- cat /app/policies/default/policy.yaml
  3. If LLM is hallucinating PII, check for recent model updates at the provider.
  4. If new entity type not covered, update policy. See policy-rollback runbook for safe steps.

OGuardAIRateLimitSaturation

Trigger: rate(guardai_rate_limit_rejections_total[5m]) > 10 for 1 min.

  1. Check which tenants: kubectl logs -l app.kubernetes.io/name=guardai -c server --tail=300 | grep 'rate_limit'
  2. If one tenant dominates: contact tenant or raise their limit.
  3. If all tenants hit limits, scale:
    kubectl scale deployment/guardai --replicas=6
    kubectl patch hpa guardai -p '{"spec":{"maxReplicas":12}}'

OGuardAINoTraffic

Trigger: rate(guardai_transforms_total[5m]) == 0 for 5 min.

  1. Health check: curl -sf http://guardai.internal:3000/v1/health
  2. Check pod status: kubectl get pods -l app.kubernetes.io/name=guardai
  3. Check routing: kubectl describe ingress guardai && kubectl get endpoints guardai
  4. If healthy but no traffic, verify upstream services have not changed routing.

OGuardAIEntityDetectionSpike

Trigger: rate(guardai_entities_detected_total[5m]) > 100 for 2 min.

  1. Check if this is legitimate high-volume traffic or a scanning attack.
  2. If suspicious, escalate to L2.

Escalation Path

LevelActionWho
L1Restart pods, check health, follow this runbookOn-call engineer
L2Root cause analysis, config changesSenior on-call / team lead
L3Code-level debugging, hotfix deploymentEngineering team

Escalate if: issue persists after L1, alert is critical, or root cause unclear after 15 min.