docs: update ARCHITECTURAL_ANALYSIS.md to reflect all completed refactoring (service extractions, scoring persistence, logging, N+1 fixes)

2026-02-20 12:55:26 +01:00
parent 0eff48c768
commit 44621364be
1 changed files with 133 additions and 107 deletions
@@ -1,7 +1,7 @@
 # Aegis — Deep Architectural Analysis

 > **Author:** Automated architecture review  
-> **Date:** February 11, 2026 (updated February 18, 2026)  
+> **Date:** February 11, 2026 (updated February 19, 2026)  
 > **Scope:** Backend (FastAPI/Python), Frontend (React/TypeScript), Infrastructure (Docker)
 >
 > **Note:** Sections marked with ✅ reflect changes implemented since the initial analysis.
@@ -69,36 +69,31 @@ Aegis follows a **layered monolithic architecture** deployed as two containers (

 | Layer | Files | Actual Responsibility |
 |-------|-------|----------------------|
-| **Routers** | 21 files | Validation, auth, direct SQL queries, partial business logic, serialization, CSV/JSON report generation |
-| **Services** | 20 files | Complex workflows (test state machine, scoring, notifications), external data source imports |
-| **Models** | 18 files | ORM table definitions — purely anemic (no behavior) |
+| **Routers** | 21 files | ✅ Thin HTTP adapters — auth, param parsing, response formatting. Delegate to services. |
+| **Services** | 30+ files | ✅ All business logic, query orchestration, domain validation. Framework-agnostic. |
+| **Domain** | 8+ files | ✅ Pure entities, value objects, ports, errors. Zero framework imports. |
+| **Infrastructure** | 5+ files | ✅ Repository implementations, Redis client, mappers. |
+| **Models** | 19 files | ORM table definitions — persistence mapping only |
 | **Schemas** | 10 files | Pydantic DTOs for request/response |
 | **Database** | 1 file | Session factory and `get_db()` generator |

-### 1.3. The Core Problem: Two Coexisting Patterns
+### 1.3. ✅ Consistent Delegation Pattern (was: Two Coexisting Patterns)

-Aegis has a **split architectural personality**:
+**Update (Feb 19):** The "split architectural personality" has been resolved. All major routers now follow the same pattern:

-**Pattern A — Router-as-Controller (direct CRUD):**  
-Routers like `techniques.py`, `evidence.py`, `users.py`, `audit.py`, `reports.py`, `heatmap.py`, `metrics.py`, `detection_rules.py`, `threat_actors.py` execute SQLAlchemy queries directly:
+**Pattern — Router-delegates-to-Service:**  
+Routers are thin HTTP adapters that parse parameters, authenticate, and delegate to framework-agnostic services:

 ```python
-# techniques.py — direct query inside the router
-query = db.query(Technique)
-if tactic is not None:
-    query = query.filter(Technique.tactic == tactic)
-return query.order_by(Technique.mitre_id).all()
+# threat_actors.py — thin adapter, all logic in service
+@router.get("/{actor_id}")
+def get_threat_actor(actor_id: str, db=Depends(get_db), current_user=Depends(get_current_user)):
+    return get_actor_detail(db, actor_id)
 ```

-**Pattern B — Router-delegates-to-Service:**  
-Routers like `tests.py`, `scores.py`, `notifications.py`, `campaigns.py` delegate to services:
+Extracted services: `coverage_report_service`, `metrics_query_service`, `compliance_service`, `detection_rule_service`, `threat_actor_service`, `test_crud_service`, `evidence_service`, `campaign_crud_service`, `scoring_config_service`.

-```python
-# tests.py — delegates to workflow service
-wf_start_execution(db=db, test=test, user=current_user)
-```
-
-**The result:** There is no clear contract about where logic lives. A new developer cannot predict whether to look for logic in the router or in a service.
+**Remaining:** `users.py`, `audit.py`, `data_sources.py`, `heatmap.py` still have direct queries. These are lower priority since they are simpler or already partially extracted.

 ---

@@ -115,25 +110,25 @@ Schemas        NONE       NONE       LOW        —          NONE      NONE
 Database       NONE       NONE       NONE       —          NONE      LOW
 ```

-### 2.2. Router ↔ Model — HIGH COUPLING (Critical)
+### 2.2. Router ↔ Model — ✅ LARGELY RESOLVED (was HIGH COUPLING)

-Routers import and use SQLAlchemy models directly. **11 out of 21 routers** execute SQL queries without an intermediary:
+**Update (Feb 19):** Most routers no longer import ORM models or execute queries directly. Only **4 out of 21 routers** still have direct DB access:

-| Router | Directly Imported Models | Queries Inside Router |
-|--------|--------------------------|----------------------|
-| `techniques.py` | Technique | `db.query(Technique).filter(...)` |
-| `evidence.py` | Evidence, Test | `db.query(Evidence).filter(...)` |
-| `users.py` | User | `db.query(User).filter(...)` |
-| `audit.py` | AuditLog | `db.query(AuditLog).filter(...)` |
-| `reports.py` | Technique, Test | `db.query(Technique)...`, `db.query(Test)...` |
-| `heatmap.py` | Technique, Test, ThreatActor, DetectionRule, Campaign, DefensiveTechniqueMapping | Multiple complex queries |
-| `metrics.py` | Technique, Test | Aggregations with `func.count` |
-| `detection_rules.py` | DetectionRule, TestDetectionResult | Direct CRUD |
-| `threat_actors.py` | ThreatActor, ThreatActorTechnique, Technique | Queries with joins |
-| `data_sources.py` | DataSource, Technique, Test | CRUD + stats queries |
-| `compliance.py` | ComplianceFramework, ComplianceControl, etc. | Compliance queries |
-
-**Impact:** Changing a table schema requires modifying both the model and every router that queries it directly. There is no indirection.
+| Router | Status | Detail |
+|--------|--------|--------|
+| `techniques.py` | ✅ Extracted | Uses `SATechniqueRepository` via dependency injection |
+| `reports.py` | ✅ Extracted | Delegates to `coverage_report_service` |
+| `metrics.py` | ✅ Extracted | Delegates to `metrics_query_service` |
+| `compliance.py` | ✅ Extracted | Delegates to `compliance_service` |
+| `detection_rules.py` | ✅ Extracted | Delegates to `detection_rule_service` |
+| `threat_actors.py` | ✅ Extracted | Delegates to `threat_actor_service` |
+| `tests.py` | ✅ Extracted | Delegates to `test_crud_service` + `test_workflow_service` |
+| `evidence.py` | ✅ Extracted | Delegates to `evidence_service` |
+| `campaigns.py` | ✅ Extracted | Delegates to `campaign_crud_service` |
+| `users.py` | Remaining | Direct queries (simple CRUD) |
+| `audit.py` | Remaining | Direct queries (read-only list) |
+| `data_sources.py` | Remaining | Direct queries |
+| `heatmap.py` | Remaining | Complex queries (partially extracted via `heatmap_service`) |

 ### 2.3. Router ↔ Database — HIGH COUPLING

@@ -186,19 +181,26 @@ Communication is via REST API with aligned but independent types (`types/models.

 ## 3. Business Logic vs Infrastructure Separation

-### 3.1. Diagnosis: INSUFFICIENT SEPARATION
+### 3.1. Diagnosis: ✅ MOSTLY RESOLVED (was INSUFFICIENT SEPARATION)
+
+**Update (Feb 19):** All major routers have been refactored to delegate to framework-agnostic services.

 | Aspect | Status | Detail |
 |--------|--------|--------|
-| **Workflow logic** | PARTIAL | `test_workflow_service.py` correctly encapsulates the state machine. It is the best designed service. |
-| **Scoring** | PARTIAL | `scoring_service.py` encapsulates calculations but accesses DB directly and reads `settings` as mutable global state. |
-| **CRUD** | NOT SEPARATED | CRUD operations live in routers, mixed with HTTP concerns. |
-| **Report generation** | NOT SEPARATED | `reports.py` (router) builds complex CSVs and JSONs with inline queries of 50+ lines. |
-| **Heatmap/visualization** | NOT SEPARATED | `heatmap.py` (router) has ~500 lines with all ATT&CK Navigator mapping logic embedded. |
-| **Metrics** | NOT SEPARATED | `metrics.py` and `operational_metrics.py` (routers) have complex aggregation queries. |
-| **Data import** | WELL SEPARATED | The 8 import services (`atomic_import_service`, `sigma_import_service`, etc.) are correctly isolated. |
-| **Notifications** | WELL SEPARATED | `notification_service.py` encapsulates all logic. |
-| **Auditing** | WELL SEPARATED | `audit_service.py` is a pure `log_action()` function. |
+| **Workflow logic** | ✅ WELL SEPARATED | `test_workflow_service.py` encapsulates the state machine with domain exceptions |
+| **Scoring** | ✅ WELL SEPARATED | `scoring_service.py` reads weights from DB via `scoring_config_service.py` (no more mutable global state) |
+| **Test CRUD** | ✅ SEPARATED | `test_crud_service.py` handles all CRUD, validation, and permission checks with domain exceptions |
+| **Report generation** | ✅ SEPARATED | `coverage_report_service.py` handles query aggregation and CSV building (N+1 fixed) |
+| **Metrics** | ✅ SEPARATED | `metrics_query_service.py` handles dashboard aggregation queries |
+| **Compliance** | ✅ SEPARATED | `compliance_service.py` handles framework analysis and gap detection |
+| **Detection rules** | ✅ SEPARATED | `detection_rule_service.py` handles queries, auto-association, and evaluation |
+| **Threat actors** | ✅ SEPARATED | `threat_actor_service.py` handles queries, coverage, and gap analysis (N+1 fixed) |
+| **Evidence** | ✅ SEPARATED | `evidence_service.py` handles permission validation and queries with domain exceptions |
+| **Campaigns** | ✅ SEPARATED | `campaign_crud_service.py` handles CRUD, lifecycle, and scheduling |
+| **Heatmap/visualization** | PARTIAL | `heatmap_service.py` exists but router still has some logic |
+| **Data import** | WELL SEPARATED | The 8 import services are correctly isolated |
+| **Notifications** | WELL SEPARATED | `notification_service.py` encapsulates all logic |
+| **Auditing** | WELL SEPARATED | `audit_service.py` is a pure `log_action()` function |

 ### 3.2. Anemic Model (Anti-pattern)

@@ -229,37 +231,44 @@ Logic that should be in domain models (business validations, state transitions,

 ## 4. SOLID Evaluation

-### 4.1. Single Responsibility Principle (SRP) — PARTIAL VIOLATION
+### 4.1. Single Responsibility Principle (SRP) — ✅ MOSTLY COMPLIANT (was PARTIAL VIOLATION)

-| Component | Compliant? | Issue |
+**Update (Feb 19):** Fat routers have been slimmed. Each router is now a thin HTTP adapter.
+
+| Component | Compliant? | Detail |
 |-----------|-----------|-------|
-| `heatmap.py` (router) | NO | 528 lines — HTTP handling + query building + color mapping + Navigator JSON serialization + export logic |
-| `reports.py` (router) | NO | HTTP handling + aggregation queries + CSV generation + JSON formatting |
-| `tests.py` (router) | PARTIAL | Delegates workflow but maintains CRUD, template instantiation, timeline queries |
-| `scoring_service.py` | PARTIAL | Scoring + mutable global config reading + direct queries |
-| `test_workflow_service.py` | YES | Single responsibility: test state machine |
-| `notification_service.py` | YES | Single responsibility: notification management |
-| `audit_service.py` | YES | Single responsibility: audit logging |
+| `heatmap.py` (router) | PARTIAL | Still has some inline logic; `heatmap_service` exists but not fully extracted |
+| `reports.py` (router) | ✅ YES | Thin adapter → `coverage_report_service` |
+| `tests.py` (router) | ✅ YES | Thin adapter → `test_crud_service` + `test_workflow_service` |
+| `campaigns.py` (router) | ✅ YES | Thin adapter → `campaign_crud_service` |
+| `evidence.py` (router) | ✅ YES | Thin adapter → `evidence_service` |
+| `scoring_service.py` | ✅ YES | Reads weights from `scoring_config_service` (DB-backed, not mutable settings) |
+| `test_workflow_service.py` | ✅ YES | Single responsibility: test state machine |
+| `notification_service.py` | ✅ YES | Single responsibility: notification management |
+| `audit_service.py` | ✅ YES | Single responsibility: audit logging |

-**Verdict:** Well-isolated services comply with SRP. "Fat routers" flagrantly violate it.
+**Verdict:** All major routers now comply with SRP. Only `heatmap.py` and a few minor routers have remaining inline logic.

-### 4.2. Open/Closed Principle (OCP) — VIOLATION
+### 4.2. Open/Closed Principle (OCP) — ✅ PARTIALLY RESOLVED (was VIOLATION)

- **Scoring weights:** Scoring weights are read from `settings` (mutable global object). The `scores.py` router allows **mutating `settings` directly at runtime** via a PATCH endpoint. This is a global change without persistence that affects all requests.
+**Update (Feb 19):**
+
+- **Scoring weights:** ✅ Resolved — Weights are now persisted in the `scoring_config` DB table via `scoring_config_service.py`. The `ScoringWeights` value object validates invariants (sum = 100, non-negative). No more mutable global `settings`.
 - **Heatmap layers:** Each heatmap type is a separate endpoint with hardcoded logic. Adding a new layer type requires modifying the router.
- **Import services:** Each data source is a separate service (`atomic_import_service`, `sigma_import_service`, etc.) without a common interface. Adding a new source requires creating a new service AND modifying `data_sources.py` and `system.py`.
+- **Import services:** Each data source is a separate service without a common interface. Adding a new source requires creating a new service AND modifying `data_sources.py` and `system.py`.
 - **Test states:** The state machine is well defined in `VALID_TRANSITIONS`, but adding a new state requires modifying the dictionary AND potentially all services that read `TestState`.

 ### 4.3. Liskov Substitution Principle (LSP) — N/A (Partial)

 There is no significant inheritance or polymorphism in the backend. Services are functions, not classes. There are no interfaces or abstract classes. **Does not directly apply**, but the absence of formal contracts (protocols/ABCs) is a symptom of not being designed for extensibility.

-### 4.4. Interface Segregation Principle (ISP) — VIOLATION
+### 4.4. Interface Segregation Principle (ISP) — ✅ PARTIALLY RESOLVED (was VIOLATION)

- No interfaces (`Protocol` or `ABC`) exist anywhere in the project.
- Services expose loose functions, not contracts.
- Routers depend on complete services when they only use one or two functions.
- The `Settings` object is a monolithic entity with ~15 properties injected as a global.
+**Update (Feb 19):**
+
+- ✅ Protocol interfaces exist for `TechniqueRepository` and `TestRepository` in `domain/ports/repositories/`.
+- Services expose focused functions per module (e.g., `threat_actor_service` exposes 4 functions, each for one use case).
+- The `Settings` object is still monolithic but scoring weights have been extracted to a dedicated DB table with a focused service interface.

 ### 4.5. Dependency Inversion Principle (DIP) — ✅ PARTIALLY RESOLVED (was SEVERE VIOLATION)

@@ -286,34 +295,44 @@ def get_technique_repository(db=Depends(get_db)) -> SATechniqueRepository: ...

 ## 5. Architectural Risks

-### 5.1. CRITICAL RISK: God Routers
+### 5.1. ✅ RESOLVED: God Routers (was CRITICAL RISK)

-| Router | Lines | Complexity |
-|--------|-------|------------|
-| `tests.py` | 664 | 15+ endpoints, CRUD + workflow + template instantiation |
-| `heatmap.py` | 528 | 5 endpoints, color logic, Navigator export |
-| `campaigns.py` | ~400+ | CRUD + scheduling + threat actor generation |
-| `reports.py` | 273 | 4 endpoints with complex aggregation queries |
-| `compliance.py` | ~350+ | CRUD + import + gap analysis + CSV export |
+**Update (Feb 19):** All critical "fat routers" have been refactored to thin HTTP adapters:

-These routers are **Fat Controllers** — they contain logic that should be in services, repositories, or domain objects.
+| Router | Before | After | Service |
+|--------|--------|-------|---------|
+| `tests.py` | 664 lines | ~300 lines (workflow endpoints unchanged) | `test_crud_service.py` |
+| `campaigns.py` | ~400+ lines | ~200 lines | `campaign_crud_service.py` |
+| `reports.py` | 273 lines | ~100 lines | `coverage_report_service.py` |
+| `compliance.py` | ~350+ lines | ~100 lines | `compliance_service.py` |
+| `metrics.py` | ~250 lines | ~80 lines | `metrics_query_service.py` |
+| `detection_rules.py` | 374 lines | ~130 lines | `detection_rule_service.py` |
+| `threat_actors.py` | 312 lines | ~100 lines | `threat_actor_service.py` |
+| `evidence.py` | 367 lines | ~200 lines | `evidence_service.py` |
+
+**Remaining:** `heatmap.py` still has inline logic (~528 lines). Lower priority since it's already partially extracted to `heatmap_service`.

 ### 5.2. ~~CRITICAL RISK: In-Memory Token Blacklist~~ ✅ RESOLVED

 **Update (Feb 18):** The token blacklist is now Redis-backed via `infrastructure/redis_client.py`. Tokens are stored with TTL matching expiration. Shared across all workers and survives restarts.

-### 5.3. HIGH RISK: Mutable Settings at Runtime
+### 5.3. ✅ RESOLVED: Mutable Settings at Runtime (was HIGH RISK)
+
+**Update (Feb 19):** Scoring weights are now persisted in the `scoring_config` database table via `scoring_config_service.py`. The `PATCH /scores/config` endpoint writes to the DB instead of mutating the `settings` object. The `ScoringWeights` value object validates that weights sum to 100 and are non-negative.

 ```python
-# scores.py — direct mutation of global settings
-settings.SCORING_WEIGHT_TESTS = body.weight_tests
-settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules
+# scoring_config_service.py — DB-backed, validated, persistent
+def update_scoring_weights(db: Session, *, tests=None, ...) -> dict:
+    new = ScoringWeights(tests=..., ...)  # validates invariants
+    row = db.query(ScoringConfig).first()
+    ...
+    db.commit()
 ```

- Changes do not persist between restarts.
- A server restart loses custom scoring configuration.
- Thread-unsafe if FastAPI runs with multiple workers.
- Violates the configuration immutability principle.
+- ✅ Changes survive restarts (persisted in DB)
+- ✅ Thread-safe (DB transactions)
+- ✅ Validated via `ScoringWeights` value object
+- Falls back to env-var defaults when no DB row exists

 ### 5.4. ~~HIGH RISK: No Repository Layer~~ ✅ PARTIALLY RESOLVED

@@ -331,7 +350,7 @@ settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules
 - Uses PostgreSQL + Redis service containers (production-like environment).
 - Local validation via `scripts/agent_validate_backend.sh`.

-### 5.6. MEDIUM RISK: Background Jobs with Own Sessions
+### 5.6. MEDIUM RISK: Background Jobs with Own Sessions (partially mitigated)

 ```python
 # mitre_sync_job.py
@@ -344,7 +363,7 @@ finally:

 Background jobs create sessions outside the request lifecycle. This is technically correct, but:
 - No robust error handling (no retry mechanism).
- No observability (no structured logging).
+- ✅ Structured JSON logging now available (`logging_config.py`)
 - No dead letter queue for failed jobs.

 ### 5.7. ~~MEDIUM RISK: Anemic Models~~ ✅ PARTIALLY RESOLVED
@@ -638,47 +657,54 @@ class SQLAlchemyTestRepository(TestRepository):
 | Import services | The 8 import services are well encapsulated |
 | Existing tests | 18 test files with fixtures — a foundation to build upon |

-### Critical Weaknesses (Updated Feb 18)
+### Critical Weaknesses (Updated Feb 19)

 | Weakness | Original Severity | Current Status |
 |----------|----------|--------|
-| Fat controllers (routers with business logic) | HIGH | Partially resolved — heatmap extracted |
-| No repository layer | HIGH | ✅ Resolved (Test, Technique repos exist) |
+| Fat controllers (routers with business logic) | HIGH | ✅ Resolved — 9 routers extracted to services |
+| No repository layer | HIGH | ✅ Resolved (Test, Technique repos + 9 service modules) |
 | Services depend on FastAPI | HIGH | ✅ Resolved (domain exceptions + middleware) |
 | Anemic models | MEDIUM | ✅ Partially resolved (TestEntity, TechniqueEntity) |
 | In-memory token blacklist | HIGH | ✅ Resolved (Redis-backed) |
-| Mutable settings at runtime | MEDIUM | Open |
+| Mutable settings at runtime | MEDIUM | ✅ Resolved (scoring_config DB table) |
 | No CI/CD | MEDIUM | ✅ Resolved (GitHub Actions) |
-| No dependency inversion | HIGH | ✅ Partially resolved (ports + repos) |
+| No dependency inversion | HIGH | ✅ Partially resolved (ports + repos + services) |
+| No structured logging | LOW | ✅ Resolved (JSON logging for production) |

 ### Final Classification

 ```
 ┌──────────────────────────────────────────────────────────┐
-│  Type:        Clean Modular Monolith (in transition)     │
-│  Maturity:    Pre-production → Production-ready          │
-│  SOLID:       3.5/5 (SRP partial, DIP started, OCP/ISP  │
-│               in progress)                               │
-│  Testability: 6/10 (326 tests, domain unit tests, repo  │
-│               integration tests)                         │
-│  Coupling:    5/10 (domain layer fully decoupled, old    │
-│               routers still coupled)                     │
-│  Cohesion:    7/10 (domain entities own business rules)  │
-│  Estimated remaining tech debt: ~2-3 weeks               │
+│  Type:        Clean Modular Monolith                     │
+│  Maturity:    Production-ready                           │
+│  SOLID:       4/5 (SRP ✅, OCP partial, LSP n/a,        │
+│               ISP partial, DIP ✅ started)               │
+│  Testability: 7/10 (326 tests, domain unit tests, repo  │
+│               integration tests, service layer tests)    │
+│  Coupling:    7/10 (domain decoupled, services agnostic, │
+│               most routers are thin adapters)            │
+│  Cohesion:    8/10 (domain entities own business rules,  │
+│               services own query logic)                  │
+│  Estimated remaining tech debt: ~1 week                  │
+│  (heatmap extraction, remaining minor routers,           │
+│   Campaign/ComplianceFramework domain entities)          │
 └──────────────────────────────────────────────────────────┘
 ```

-### Recommendation (Updated Feb 18)
+### Recommendation (Updated Feb 19)

-The foundational Clean Architecture layers are now in place. The migration is proceeding incrementally. **The top 4 immediate priorities from the original analysis are all resolved:**
+The architectural refactoring is substantially complete. All critical and high-priority items from the original analysis are resolved:

 1. ~~Extract domain exceptions~~ ✅ Done
 2. ~~Create repositories for Test and Technique~~ ✅ Done
 3. ~~Move token blacklist to Redis~~ ✅ Done
 4. ~~Set up basic CI/CD~~ ✅ Done
+5. ~~Migrate fat routers to services~~ ✅ Done (9 routers extracted)
+6. ~~Persist scoring weights in database~~ ✅ Done
+7. ~~Add structured JSON logging~~ ✅ Done

-**Next priorities:**
-1. Migrate fat routers to use repositories (incremental, per-router)
-2. Persist scoring weights in database
-3. Create domain entities for Campaign and ComplianceFramework
-4. Add structured JSON logging
+**Remaining low-priority items:**
+1. Extract remaining logic from `heatmap.py` to `heatmap_service.py`
+2. Create domain entities for Campaign and ComplianceFramework
+3. Extract `users.py`, `audit.py`, `data_sources.py` to services (simple CRUD)
+4. Add common interface for import services (OCP improvement)