Aegis/docs/ARCHITECTURE.md

# Aegis — Architecture

## High-Level Overview

```
┌────────────────────┐       ┌─────────────────────┐
│   React Frontend   │──────▶│   FastAPI Backend    │
│  (Vite / TS / TW)  │ REST  │  (Python 3.11)      │
└────────────────────┘       └──────┬──────┬────────┘
                                    │      │
                          ┌─────────┘      └─────────┐
                          ▼                          ▼
                 ┌─────────────────┐       ┌─────────────────┐
                 │   PostgreSQL    │       │     MinIO        │
                 │  (Data Store)   │       │ (Object Storage) │
                 └─────────────────┘       └─────────────────┘
```

- **Frontend** — React 19 + TypeScript + Tailwind CSS v4 + TanStack Query
- **Backend** — FastAPI with SQLAlchemy ORM + Alembic migrations
- **Database** — PostgreSQL 15 with UUID primary keys and JSONB columns
- **Object Storage** — MinIO (S3-compatible) for evidence files
- **Scheduler** — APScheduler (in-process) for background jobs

---

## Database Schema

### Core Tables

| Table | Description |
|-------|-------------|
| `users` | User accounts with role-based access (admin, red_tech, blue_tech, red_lead, blue_lead, viewer) |
| `techniques` | MITRE ATT&CK techniques with coverage status, tactic, platforms (JSONB) |
| `tests` | Security tests with full Red/Blue workflow fields, dual validation, remediation, and retest chain |
| `test_templates` | Predefined test catalog from Atomic Red Team, Sigma, CALDERA, LOLBAS, custom |
| `evidences` | Evidence files separated by team (red/blue) with SHA256 integrity verification |

### Detection & Defense

| Table | Description |
|-------|-------------|
| `detection_rules` | Imported detection rules (Sigma, Elastic, custom) linked to ATT&CK techniques |
| `test_detection_results` | Per-test detection rule evaluation results (triggered / not triggered) |
| `test_template_detection_rules` | Template ↔ detection rule associations |
| `defensive_techniques` | MITRE D3FEND defensive techniques |
| `defensive_technique_mappings` | ATT&CK technique ↔ D3FEND defensive technique mappings |

### Campaigns & Scheduling

| Table | Description |
|-------|-------------|
| `campaigns` | Test campaign groupings with scheduling (recurring, weekly/monthly/quarterly) |
| `campaign_tests` | Ordered test assignments within campaigns with dependency support |

### Intelligence & Actors

| Table | Description |
|-------|-------------|
| `threat_actors` | MITRE CTI intrusion sets with aliases, country, motivation, JSONB targets |
| `threat_actor_techniques` | Threat actor ↔ ATT&CK technique mappings |
| `intel_items` | Threat intelligence items from RSS feeds |

### Compliance

| Table | Description |
|-------|-------------|
| `compliance_frameworks` | Compliance frameworks (e.g., NIST 800-53) |
| `compliance_controls` | Individual controls within a framework |
| `compliance_control_mappings` | Control ↔ ATT&CK technique mappings |

### Operational

| Table | Description |
|-------|-------------|
| `coverage_snapshots` | Point-in-time coverage status captures with aggregate metrics |
| `snapshot_technique_states` | Normalized per-technique state within a snapshot |
| `audit_logs` | System-wide audit trail with JSONB details |
| `notifications` | In-app notifications with read status |
| `data_sources` | External data source configuration and sync status |

### Key Relationships

```
Technique ──1:N── Test ──1:N── Evidence
    │                │
    │                ├── TestDetectionResult ──N:1── DetectionRule
    │                └── CampaignTest ──N:1── Campaign
    │
    ├── ThreatActorTechnique ──N:1── ThreatActor
    ├── DefensiveTechniqueMapping ──N:1── DefensiveTechnique
    ├── ComplianceControlMapping ──N:1── ComplianceControl ──N:1── ComplianceFramework
    └── SnapshotTechniqueState ──N:1── CoverageSnapshot

Test ──retest_of──▶ Test  (self-referential retest chain)
Campaign ──parent_campaign_id──▶ Campaign  (recurring execution history)
```

---

## Backend Architecture

### Layered Structure

```
routers/          ← Thin HTTP adapters (auth, params, response shaping — zero inline ORM)
  ↓
services/         ← Framework-agnostic business logic (46 service modules, ~250 functions)
  ↓
domain/           ← Pure business rules (entities, value objects, ports, errors — zero framework imports)
  ↓
infrastructure/   ← Repository implementations (SQLAlchemy), Redis, mappers
  ↓
models/           ← SQLAlchemy ORM models (persistence mapping only)
  ↓
database.py       ← Engine + session management (lazy initialization)
```

**Dependency rule:** routers → services → domain ← infrastructure. Dependencies always point inward toward domain.

**Transaction management:** Services never call `db.commit()`. Routers manage transactions via `UnitOfWork`. Import services and background jobs are documented exceptions (self-contained batch operations).

### Services

#### Business Logic Services

| Service | Responsibility |
|---------|---------------|
| `test_workflow_service` | Test state machine (draft → validated/rejected) with dual validation |
| `test_crud_service` | Test CRUD, query logic, permission validation |
| `scoring_service` | 0–100 scoring for techniques, tactics, actors, organization |
| `scoring_config_service` | DB-persisted scoring weights with validation |
| `score_cache` | In-memory TTL cache (5 min) for expensive score/metric calculations |
| `operational_metrics_service` | MTTD, MTTR, detection efficacy, alert fidelity, coverage velocity |
| `metrics_query_service` | Dashboard aggregation queries |
| `advanced_metrics_service` | Coverage by tactic, never-tested, avg validation time, detection trends |
| `analytics_service` | BI-ready flat datasets (coverage, tests, trends, operators) |
| `snapshot_service` | Coverage snapshot CRUD, temporal comparison, cleanup |
| `campaign_crud_service` | Campaign CRUD, lifecycle, scheduling |
| `campaign_service` | Campaign progress tracking, circular dependency prevention |
| `campaign_scheduler_service` | Recurring campaign execution (clone + schedule next run) |
| `status_service` | Technique status recalculation from test results |
| `coverage_report_service` | Coverage report generation and CSV export |
| `compliance_service` | Compliance framework analysis and gap detection |
| `detection_rule_service` | Detection rule queries, auto-association, evaluation |
| `threat_actor_service` | Threat actor queries, coverage, gap analysis |
| `evidence_service` | Evidence permission validation and queries |
| `heatmap_service` | ATT&CK Navigator layer generation |
| `test_template_service` | Test template CRUD, stats, bulk-activate, filtered queries |
| `auth_service` | Credential validation, password management |
| `user_service` | User CRUD, role validation, password hashing |
| `audit_query_service` | Paginated audit log queries and distinct lookups |
| `audit_service` | Immutable audit trail logging (write-only) |
| `data_source_service` | Data source CRUD, sync dispatch, statistics |
| `notification_service` | In-app notification CRUD, state-change alerts, role-based dispatch |
| `technique_query_service` | Technique detail queries with test/D3FEND aggregation |
| `d3fend_query_service` | D3FEND defensive technique listing and tactic queries |
| `osint_enrichment_service` | OSINT item queries, enrichment, summary statistics |
| `worklog_service` | Worklog CRUD, integrity verification |
| `intel_service` | RSS-based threat intelligence scanning |

#### Import Services (all satisfy `ImportService` protocol)

| Service | Responsibility |
|---------|---------------|
| `mitre_sync_service` | MITRE ATT&CK sync via TAXII 2.0 / GitHub fallback |
| `atomic_import_service` | Atomic Red Team template import from GitHub |
| `sigma_import_service` | SigmaHQ detection rule import |
| `elastic_import_service` | Elastic detection rule import (TOML) |
| `caldera_import_service` | CALDERA ability import |
| `lolbas_import_service` | LOLBAS/GTFOBins template import |
| `d3fend_import_service` | MITRE D3FEND defensive technique import |
| `threat_actor_import_service` | MITRE CTI threat actor import (STIX) |
| `compliance_import_service` | NIST 800-53 ↔ ATT&CK mapping import |

### Domain Layer

```
domain/
├── entities/              # Rich domain entities with business logic
│   ├── technique.py       # TechniqueEntity with status recalculation
│   ├── campaign.py        # CampaignEntity with lifecycle state machine
│   ├── compliance.py      # ComplianceFrameworkEntity with coverage calculation
│   └── threat_actor.py    # ThreatActorEntity with coverage analysis
├── value_objects/          # Immutable value types
│   ├── mitre_id.py        # MITRE ATT&CK ID validation
│   └── scoring_weights.py # Scoring weights (sum=100, non-negative)
├── ports/                  # Interfaces (Protocol contracts)
│   ├── repositories/      # TechniqueRepository, TestRepository
│   └── import_service.py  # ImportService protocol + IMPORT_REGISTRY
├── errors.py              # Domain exceptions (EntityNotFoundError, etc.)
├── enums.py               # TestState, TechniqueStatus, TestResult
├── test_entity.py         # TestEntity with state machine + domain events
└── unit_of_work.py        # UnitOfWork context manager
```

### Scheduled Jobs (APScheduler)

| Job | Schedule | Description |
|-----|----------|-------------|
| MITRE Sync | Every 24h | Sync ATT&CK techniques from TAXII/GitHub |
| Intel Scan | Every 7 days | Scan RSS feeds for threat intelligence |
| Notification Cleanup | Every 24h | Remove old read notifications |
| Weekly Snapshot | Sundays 00:00 | Create coverage snapshot + cleanup old ones |
| Recurring Campaigns | Every 24h | Check and execute due recurring campaigns |

---

## Test Lifecycle (State Machine)

```
┌──────┐    ┌──────────────┐    ┌─────────────────┐    ┌───────────┐
│ DRAFT│───▶│RED_EXECUTING │───▶│ BLUE_EVALUATING  │───▶│ IN_REVIEW │
└──────┘    └──────────────┘    └─────────────────┘    └─────┬─────┘
                                                             │
                                         ┌───────────────────┤
                                         ▼                   ▼
                                   ┌──────────┐       ┌──────────┐
                                   │ REJECTED │       │VALIDATED │
                                   └────┬─────┘       └──────────┘
                                        │                    │
                                        └──▶ Back to DRAFT   ├──▶ Remediation
                                                             └──▶ Auto Re-test
```

**Dual Validation in IN_REVIEW:**
- Red Lead votes approve/reject
- Blue Lead votes approve/reject
- Both approve → VALIDATED
- Either rejects → REJECTED
- One votes, other pending → stays IN_REVIEW

**Auto Re-testing:** When remediation is completed on a validated test, the system automatically creates a follow-up retest (up to `MAX_RETEST_COUNT` = 3).

---

## Frontend Architecture

### Key Technologies

- **React 19** with TypeScript
- **Vite 7** for bundling
- **Tailwind CSS v4** for styling
- **TanStack Query** for server state management
- **TanStack Virtual** for table virtualization
- **React Router v7** for routing
- **Recharts** for charts and visualizations
- **Lucide React** for icons

### Page Lazy Loading

All pages except `LoginPage` and `DashboardPage` are lazy-loaded via `React.lazy()` with `<Suspense>` fallbacks for optimal initial bundle size.

### Role-Based Navigation

The sidebar dynamically filters navigation items based on the current user's role:

| Section | Visible to |
|---------|-----------|
| Dashboard | All roles |
| Executive Dashboard | admin, red_lead, blue_lead |
| ATT&CK Matrix | All roles |
| Tests (sub-menu) | All roles |
| Campaigns | All roles |
| Threat Actors | All roles |
| Compliance | All roles |
| Comparison | admin, red_lead, blue_lead |
| Reports | All roles |
| System (admin section) | admin only |

### Performance Optimizations

- **React.memo** on `HeatmapCell` (renders 3000+ times in full matrix)
- **useMemo** / **useCallback** for expensive calculations in memoized components
- **useDebounce** hook for search inputs (300ms delay)
- **TanStack Virtual** for large table virtualization (test templates, detection rules, audit logs)
- **Lazy loading** for all non-critical page bundles