Add wiki page: Operational-Alerts

2026-05-22 12:33:06 +00:00
parent 817679c70b
commit 73be05601e
1 changed files with 227 additions and 0 deletions
--- a/Operational-Alerts.-.md
+++ b/Operational-Alerts.-.md
@@ -0,0 +1,227 @@
 # Operational Alerts
 The operational alerts system monitors the state of your security coverage and
 notifies the team when conditions fall below defined thresholds.
 ---
 ## Alert Rules
 Alert rules define **what to check** and **when to fire**. Each rule has a type,
 severity, configuration thresholds, and notification preferences.
 ### Rule Types
 | Rule Type | What it checks |
 |-----------|---------------|
 | `coverage_drop` | Overall coverage score drops below a threshold |
 | `stale_test` | A test has been in `red_executing` or `blue_evaluating` for too long |
 | `unvalidated_test` | Tests stuck in `in_review` beyond a threshold duration |
 | `high_risk_uncovered` | High-severity techniques have no validated tests |
 | `detection_gap` | Technique has validated attack tests but no detection rule |
 ### Rule Fields
 ```json
 {
  "name": "Coverage below 70%",
  "description": "Alert when overall coverage drops below 70%",
  "rule_type": "coverage_drop",
  "severity": "high",
  "config": {
    "threshold": 70.0,
    "tactic_id": null
  },
  "is_enabled": true,
  "cooldown_hours": 24,
  "notify_in_app": true,
  "notify_webhook": true,
  "webhook_id": "webhook-uuid-or-null"
 }
 ```
 ### Severity Levels
 | Severity | Use case |
 |----------|---------|
 | `info` | Informational; no action needed immediately |
 | `low` | Worth noting but not urgent |
 | `medium` | Should be addressed in next sprint |
 | `high` | Requires prompt attention |
 | `critical` | Immediate action required |
 ### Rule Configuration Examples
 **Coverage drop:**
 ```json
 {"threshold": 75.0}
 ```
 Fires when organization score drops below 75%.
 **Stale test:**
 ```json
 {"stale_days": 7}
 ```
 Fires for any test in executing/evaluating state for more than 7 days.
 **High risk uncovered:**
 ```json
 {"min_severity": "high", "max_uncovered": 5}
 ```
 Fires when more than 5 high-severity techniques have no validated test.
 **Detection gap:**
 ```json
 {"require_detection_rule": true}
 ```
 Fires for every validated attack test that has no linked detection rule.
 ---
 ## Alert Instances
 When a rule's condition is met and the rule is not in cooldown, an alert instance is created.
 ### Instance Lifecycle
 ```
 open ──────────────> acknowledged ──────────────> resolved
  │                                                   │
  └────────────────> dismissed                        │
                         │                            │
                         └── suppressed until         └── final state
                             cooldown resets               (immutable)
 ```
 ### Instance Fields
 ```json
 {
  "id": "uuid",
  "rule_id": "uuid",
  "rule_name": "Coverage below 70%",
  "rule_type": "coverage_drop",
  "severity": "high",
  "status": "open",
  "details": {"current_score": 67.3, "threshold": 70.0},
  "fired_at": "2024-03-15T10:00:00Z",
  "acknowledged_at": null,
  "acknowledged_by": null,
  "resolved_at": null,
  "dismissed_at": null
 }
 ```
 ---
 ## Alert Lifecycle Actions
 ### Acknowledge
 Marks the alert as seen and being investigated. Does NOT suppress re-firing.
 ```http
 POST /api/v1/alerts/{id}/acknowledge
 {"notes": "Investigating coverage drop — two campaigns just completed"}
 ```
 Required role: red_lead, blue_lead, admin
 ### Resolve
 Marks the underlying issue as fixed. Prevents re-evaluation from creating a
 duplicate alert (until cooldown expires and condition is met again).
 ```http
 POST /api/v1/alerts/{id}/resolve
 {"resolution_notes": "Coverage restored to 78% after campaign validation"}
 ```
 Required role: red_lead, blue_lead, admin
 ### Dismiss
 Suppresses the alert for the rule's cooldown period.
 ```http
 POST /api/v1/alerts/{id}/dismiss
 {"reason": "Planned maintenance window — coverage drop expected"}
 ```
 Required role: red_lead, blue_lead, admin
 ---
 ## Alert Evaluation
 ### Automatic (hourly)
 Aegis runs alert evaluation every hour via APScheduler:
 - Checks all `is_enabled=true` rules
 - For each rule, evaluates the condition against current data
 - Creates an instance if condition is met AND rule is not in cooldown
 - Sends in-app notifications and/or webhook calls per rule configuration
 ### Manual trigger
 ```http
 POST /api/v1/alerts/evaluate
 ```
 Required role: red_lead, blue_lead, admin
 Useful when you've made changes and want to check immediately without waiting for the hourly job.
 ---
 ## In-App Notifications
 When `notify_in_app: true` on a rule, an in-app notification is sent to all users
 with role red_lead, blue_lead, or admin.
 View notifications:
 ```http
 GET /api/v1/notifications
 ```
 Mark as read:
 ```http
 PATCH /api/v1/notifications/{id}
 {"is_read": true}
 ```
 ---
 ## Webhook Notifications
 When `notify_webhook: true` and a `webhook_id` is set, Aegis POSTs to the configured
 webhook URL when the alert fires.
 Webhook payload:
 ```json
 {
  "event": "alert.fired",
  "alert_id": "uuid",
  "rule_name": "Coverage below 70%",
  "severity": "high",
  "details": {"current_score": 67.3, "threshold": 70.0},
  "fired_at": "2024-03-15T10:00:00Z"
 }
 ```
 ---
 ## Summary
 ```http
 GET /api/v1/alerts/summary
 ```
 Returns:
 ```json
 {
  "total": 12,
  "by_status": {"open": 5, "acknowledged": 3, "resolved": 3, "dismissed": 1},
  "by_severity": {"critical": 1, "high": 4, "medium": 5, "low": 2, "info": 0},
  "by_type": {
    "coverage_drop": 2,
    "stale_test": 4,
    "unvalidated_test": 3,
    "high_risk_uncovered": 2,
    "detection_gap": 1
  }
 }
 ```