Add wiki page: Operational-Alerts

2026-05-22 12:33:06 +00:00
parent 817679c70b
commit 73be05601e
1 changed files with 227 additions and 0 deletions
--- a/Operational-Alerts.-.md
+++ b/Operational-Alerts.-.md
@@ -0,0 +1,227 @@
+# Operational Alerts
+
+The operational alerts system monitors the state of your security coverage and
+notifies the team when conditions fall below defined thresholds.
+
+---
+
+## Alert Rules
+
+Alert rules define **what to check** and **when to fire**. Each rule has a type,
+severity, configuration thresholds, and notification preferences.
+
+### Rule Types
+
+| Rule Type | What it checks |
+|-----------|---------------|
+| `coverage_drop` | Overall coverage score drops below a threshold |
+| `stale_test` | A test has been in `red_executing` or `blue_evaluating` for too long |
+| `unvalidated_test` | Tests stuck in `in_review` beyond a threshold duration |
+| `high_risk_uncovered` | High-severity techniques have no validated tests |
+| `detection_gap` | Technique has validated attack tests but no detection rule |
+
+### Rule Fields
+
+```json
+{
+  "name": "Coverage below 70%",
+  "description": "Alert when overall coverage drops below 70%",
+  "rule_type": "coverage_drop",
+  "severity": "high",
+  "config": {
+    "threshold": 70.0,
+    "tactic_id": null
+  },
+  "is_enabled": true,
+  "cooldown_hours": 24,
+  "notify_in_app": true,
+  "notify_webhook": true,
+  "webhook_id": "webhook-uuid-or-null"
+}
+```
+
+### Severity Levels
+
+| Severity | Use case |
+|----------|---------|
+| `info` | Informational; no action needed immediately |
+| `low` | Worth noting but not urgent |
+| `medium` | Should be addressed in next sprint |
+| `high` | Requires prompt attention |
+| `critical` | Immediate action required |
+
+### Rule Configuration Examples
+
+**Coverage drop:**
+```json
+{"threshold": 75.0}
+```
+Fires when organization score drops below 75%.
+
+**Stale test:**
+```json
+{"stale_days": 7}
+```
+Fires for any test in executing/evaluating state for more than 7 days.
+
+**High risk uncovered:**
+```json
+{"min_severity": "high", "max_uncovered": 5}
+```
+Fires when more than 5 high-severity techniques have no validated test.
+
+**Detection gap:**
+```json
+{"require_detection_rule": true}
+```
+Fires for every validated attack test that has no linked detection rule.
+
+---
+
+## Alert Instances
+
+When a rule's condition is met and the rule is not in cooldown, an alert instance is created.
+
+### Instance Lifecycle
+
+```
+open ──────────────> acknowledged ──────────────> resolved
+  │                                                   │
+  └────────────────> dismissed                        │
+                         │                            │
+                         └── suppressed until         └── final state
+                             cooldown resets               (immutable)
+```
+
+### Instance Fields
+
+```json
+{
+  "id": "uuid",
+  "rule_id": "uuid",
+  "rule_name": "Coverage below 70%",
+  "rule_type": "coverage_drop",
+  "severity": "high",
+  "status": "open",
+  "details": {"current_score": 67.3, "threshold": 70.0},
+  "fired_at": "2024-03-15T10:00:00Z",
+  "acknowledged_at": null,
+  "acknowledged_by": null,
+  "resolved_at": null,
+  "dismissed_at": null
+}
+```
+
+---
+
+## Alert Lifecycle Actions
+
+### Acknowledge
+
+Marks the alert as seen and being investigated. Does NOT suppress re-firing.
+```http
+POST /api/v1/alerts/{id}/acknowledge
+{"notes": "Investigating coverage drop — two campaigns just completed"}
+```
+Required role: red_lead, blue_lead, admin
+
+### Resolve
+
+Marks the underlying issue as fixed. Prevents re-evaluation from creating a
+duplicate alert (until cooldown expires and condition is met again).
+```http
+POST /api/v1/alerts/{id}/resolve
+{"resolution_notes": "Coverage restored to 78% after campaign validation"}
+```
+Required role: red_lead, blue_lead, admin
+
+### Dismiss
+
+Suppresses the alert for the rule's cooldown period.
+```http
+POST /api/v1/alerts/{id}/dismiss
+{"reason": "Planned maintenance window — coverage drop expected"}
+```
+Required role: red_lead, blue_lead, admin
+
+---
+
+## Alert Evaluation
+
+### Automatic (hourly)
+
+Aegis runs alert evaluation every hour via APScheduler:
+- Checks all `is_enabled=true` rules
+- For each rule, evaluates the condition against current data
+- Creates an instance if condition is met AND rule is not in cooldown
+- Sends in-app notifications and/or webhook calls per rule configuration
+
+### Manual trigger
+
+```http
+POST /api/v1/alerts/evaluate
+```
+Required role: red_lead, blue_lead, admin
+
+Useful when you've made changes and want to check immediately without waiting for the hourly job.
+
+---
+
+## In-App Notifications
+
+When `notify_in_app: true` on a rule, an in-app notification is sent to all users
+with role red_lead, blue_lead, or admin.
+
+View notifications:
+```http
+GET /api/v1/notifications
+```
+
+Mark as read:
+```http
+PATCH /api/v1/notifications/{id}
+{"is_read": true}
+```
+
+---
+
+## Webhook Notifications
+
+When `notify_webhook: true` and a `webhook_id` is set, Aegis POSTs to the configured
+webhook URL when the alert fires.
+
+Webhook payload:
+```json
+{
+  "event": "alert.fired",
+  "alert_id": "uuid",
+  "rule_name": "Coverage below 70%",
+  "severity": "high",
+  "details": {"current_score": 67.3, "threshold": 70.0},
+  "fired_at": "2024-03-15T10:00:00Z"
+}
+```
+
+---
+
+## Summary
+
+```http
+GET /api/v1/alerts/summary
+```
+
+Returns:
+```json
+{
+  "total": 12,
+  "by_status": {"open": 5, "acknowledged": 3, "resolved": 3, "dismissed": 1},
+  "by_severity": {"critical": 1, "high": 4, "medium": 5, "low": 2, "info": 0},
+  "by_type": {
+    "coverage_drop": 2,
+    "stale_test": 4,
+    "unvalidated_test": 3,
+    "high_risk_uncovered": 2,
+    "detection_gap": 1
+  }
+}
+```