diff --git a/docs/ARCHITECTURAL_ANALYSIS.md b/docs/ARCHITECTURAL_ANALYSIS.md index 3c0deff..8d0698c 100644 --- a/docs/ARCHITECTURAL_ANALYSIS.md +++ b/docs/ARCHITECTURAL_ANALYSIS.md @@ -1,7 +1,7 @@ # Aegis — Deep Architectural Analysis > **Author:** Automated architecture review -> **Date:** February 11, 2026 (updated February 18, 2026) +> **Date:** February 11, 2026 (updated February 19, 2026) > **Scope:** Backend (FastAPI/Python), Frontend (React/TypeScript), Infrastructure (Docker) > > **Note:** Sections marked with ✅ reflect changes implemented since the initial analysis. @@ -69,36 +69,31 @@ Aegis follows a **layered monolithic architecture** deployed as two containers ( | Layer | Files | Actual Responsibility | |-------|-------|----------------------| -| **Routers** | 21 files | Validation, auth, direct SQL queries, partial business logic, serialization, CSV/JSON report generation | -| **Services** | 20 files | Complex workflows (test state machine, scoring, notifications), external data source imports | -| **Models** | 18 files | ORM table definitions — purely anemic (no behavior) | +| **Routers** | 21 files | ✅ Thin HTTP adapters — auth, param parsing, response formatting. Delegate to services. | +| **Services** | 30+ files | ✅ All business logic, query orchestration, domain validation. Framework-agnostic. | +| **Domain** | 8+ files | ✅ Pure entities, value objects, ports, errors. Zero framework imports. | +| **Infrastructure** | 5+ files | ✅ Repository implementations, Redis client, mappers. | +| **Models** | 19 files | ORM table definitions — persistence mapping only | | **Schemas** | 10 files | Pydantic DTOs for request/response | | **Database** | 1 file | Session factory and `get_db()` generator | -### 1.3. The Core Problem: Two Coexisting Patterns +### 1.3. ✅ Consistent Delegation Pattern (was: Two Coexisting Patterns) -Aegis has a **split architectural personality**: +**Update (Feb 19):** The "split architectural personality" has been resolved. All major routers now follow the same pattern: -**Pattern A — Router-as-Controller (direct CRUD):** -Routers like `techniques.py`, `evidence.py`, `users.py`, `audit.py`, `reports.py`, `heatmap.py`, `metrics.py`, `detection_rules.py`, `threat_actors.py` execute SQLAlchemy queries directly: +**Pattern — Router-delegates-to-Service:** +Routers are thin HTTP adapters that parse parameters, authenticate, and delegate to framework-agnostic services: ```python -# techniques.py — direct query inside the router -query = db.query(Technique) -if tactic is not None: - query = query.filter(Technique.tactic == tactic) -return query.order_by(Technique.mitre_id).all() +# threat_actors.py — thin adapter, all logic in service +@router.get("/{actor_id}") +def get_threat_actor(actor_id: str, db=Depends(get_db), current_user=Depends(get_current_user)): + return get_actor_detail(db, actor_id) ``` -**Pattern B — Router-delegates-to-Service:** -Routers like `tests.py`, `scores.py`, `notifications.py`, `campaigns.py` delegate to services: +Extracted services: `coverage_report_service`, `metrics_query_service`, `compliance_service`, `detection_rule_service`, `threat_actor_service`, `test_crud_service`, `evidence_service`, `campaign_crud_service`, `scoring_config_service`. -```python -# tests.py — delegates to workflow service -wf_start_execution(db=db, test=test, user=current_user) -``` - -**The result:** There is no clear contract about where logic lives. A new developer cannot predict whether to look for logic in the router or in a service. +**Remaining:** `users.py`, `audit.py`, `data_sources.py`, `heatmap.py` still have direct queries. These are lower priority since they are simpler or already partially extracted. --- @@ -115,25 +110,25 @@ Schemas NONE NONE LOW — NONE NONE Database NONE NONE NONE — NONE LOW ``` -### 2.2. Router ↔ Model — HIGH COUPLING (Critical) +### 2.2. Router ↔ Model — ✅ LARGELY RESOLVED (was HIGH COUPLING) -Routers import and use SQLAlchemy models directly. **11 out of 21 routers** execute SQL queries without an intermediary: +**Update (Feb 19):** Most routers no longer import ORM models or execute queries directly. Only **4 out of 21 routers** still have direct DB access: -| Router | Directly Imported Models | Queries Inside Router | -|--------|--------------------------|----------------------| -| `techniques.py` | Technique | `db.query(Technique).filter(...)` | -| `evidence.py` | Evidence, Test | `db.query(Evidence).filter(...)` | -| `users.py` | User | `db.query(User).filter(...)` | -| `audit.py` | AuditLog | `db.query(AuditLog).filter(...)` | -| `reports.py` | Technique, Test | `db.query(Technique)...`, `db.query(Test)...` | -| `heatmap.py` | Technique, Test, ThreatActor, DetectionRule, Campaign, DefensiveTechniqueMapping | Multiple complex queries | -| `metrics.py` | Technique, Test | Aggregations with `func.count` | -| `detection_rules.py` | DetectionRule, TestDetectionResult | Direct CRUD | -| `threat_actors.py` | ThreatActor, ThreatActorTechnique, Technique | Queries with joins | -| `data_sources.py` | DataSource, Technique, Test | CRUD + stats queries | -| `compliance.py` | ComplianceFramework, ComplianceControl, etc. | Compliance queries | - -**Impact:** Changing a table schema requires modifying both the model and every router that queries it directly. There is no indirection. +| Router | Status | Detail | +|--------|--------|--------| +| `techniques.py` | ✅ Extracted | Uses `SATechniqueRepository` via dependency injection | +| `reports.py` | ✅ Extracted | Delegates to `coverage_report_service` | +| `metrics.py` | ✅ Extracted | Delegates to `metrics_query_service` | +| `compliance.py` | ✅ Extracted | Delegates to `compliance_service` | +| `detection_rules.py` | ✅ Extracted | Delegates to `detection_rule_service` | +| `threat_actors.py` | ✅ Extracted | Delegates to `threat_actor_service` | +| `tests.py` | ✅ Extracted | Delegates to `test_crud_service` + `test_workflow_service` | +| `evidence.py` | ✅ Extracted | Delegates to `evidence_service` | +| `campaigns.py` | ✅ Extracted | Delegates to `campaign_crud_service` | +| `users.py` | Remaining | Direct queries (simple CRUD) | +| `audit.py` | Remaining | Direct queries (read-only list) | +| `data_sources.py` | Remaining | Direct queries | +| `heatmap.py` | Remaining | Complex queries (partially extracted via `heatmap_service`) | ### 2.3. Router ↔ Database — HIGH COUPLING @@ -186,19 +181,26 @@ Communication is via REST API with aligned but independent types (`types/models. ## 3. Business Logic vs Infrastructure Separation -### 3.1. Diagnosis: INSUFFICIENT SEPARATION +### 3.1. Diagnosis: ✅ MOSTLY RESOLVED (was INSUFFICIENT SEPARATION) + +**Update (Feb 19):** All major routers have been refactored to delegate to framework-agnostic services. | Aspect | Status | Detail | |--------|--------|--------| -| **Workflow logic** | PARTIAL | `test_workflow_service.py` correctly encapsulates the state machine. It is the best designed service. | -| **Scoring** | PARTIAL | `scoring_service.py` encapsulates calculations but accesses DB directly and reads `settings` as mutable global state. | -| **CRUD** | NOT SEPARATED | CRUD operations live in routers, mixed with HTTP concerns. | -| **Report generation** | NOT SEPARATED | `reports.py` (router) builds complex CSVs and JSONs with inline queries of 50+ lines. | -| **Heatmap/visualization** | NOT SEPARATED | `heatmap.py` (router) has ~500 lines with all ATT&CK Navigator mapping logic embedded. | -| **Metrics** | NOT SEPARATED | `metrics.py` and `operational_metrics.py` (routers) have complex aggregation queries. | -| **Data import** | WELL SEPARATED | The 8 import services (`atomic_import_service`, `sigma_import_service`, etc.) are correctly isolated. | -| **Notifications** | WELL SEPARATED | `notification_service.py` encapsulates all logic. | -| **Auditing** | WELL SEPARATED | `audit_service.py` is a pure `log_action()` function. | +| **Workflow logic** | ✅ WELL SEPARATED | `test_workflow_service.py` encapsulates the state machine with domain exceptions | +| **Scoring** | ✅ WELL SEPARATED | `scoring_service.py` reads weights from DB via `scoring_config_service.py` (no more mutable global state) | +| **Test CRUD** | ✅ SEPARATED | `test_crud_service.py` handles all CRUD, validation, and permission checks with domain exceptions | +| **Report generation** | ✅ SEPARATED | `coverage_report_service.py` handles query aggregation and CSV building (N+1 fixed) | +| **Metrics** | ✅ SEPARATED | `metrics_query_service.py` handles dashboard aggregation queries | +| **Compliance** | ✅ SEPARATED | `compliance_service.py` handles framework analysis and gap detection | +| **Detection rules** | ✅ SEPARATED | `detection_rule_service.py` handles queries, auto-association, and evaluation | +| **Threat actors** | ✅ SEPARATED | `threat_actor_service.py` handles queries, coverage, and gap analysis (N+1 fixed) | +| **Evidence** | ✅ SEPARATED | `evidence_service.py` handles permission validation and queries with domain exceptions | +| **Campaigns** | ✅ SEPARATED | `campaign_crud_service.py` handles CRUD, lifecycle, and scheduling | +| **Heatmap/visualization** | PARTIAL | `heatmap_service.py` exists but router still has some logic | +| **Data import** | WELL SEPARATED | The 8 import services are correctly isolated | +| **Notifications** | WELL SEPARATED | `notification_service.py` encapsulates all logic | +| **Auditing** | WELL SEPARATED | `audit_service.py` is a pure `log_action()` function | ### 3.2. Anemic Model (Anti-pattern) @@ -229,37 +231,44 @@ Logic that should be in domain models (business validations, state transitions, ## 4. SOLID Evaluation -### 4.1. Single Responsibility Principle (SRP) — PARTIAL VIOLATION +### 4.1. Single Responsibility Principle (SRP) — ✅ MOSTLY COMPLIANT (was PARTIAL VIOLATION) -| Component | Compliant? | Issue | +**Update (Feb 19):** Fat routers have been slimmed. Each router is now a thin HTTP adapter. + +| Component | Compliant? | Detail | |-----------|-----------|-------| -| `heatmap.py` (router) | NO | 528 lines — HTTP handling + query building + color mapping + Navigator JSON serialization + export logic | -| `reports.py` (router) | NO | HTTP handling + aggregation queries + CSV generation + JSON formatting | -| `tests.py` (router) | PARTIAL | Delegates workflow but maintains CRUD, template instantiation, timeline queries | -| `scoring_service.py` | PARTIAL | Scoring + mutable global config reading + direct queries | -| `test_workflow_service.py` | YES | Single responsibility: test state machine | -| `notification_service.py` | YES | Single responsibility: notification management | -| `audit_service.py` | YES | Single responsibility: audit logging | +| `heatmap.py` (router) | PARTIAL | Still has some inline logic; `heatmap_service` exists but not fully extracted | +| `reports.py` (router) | ✅ YES | Thin adapter → `coverage_report_service` | +| `tests.py` (router) | ✅ YES | Thin adapter → `test_crud_service` + `test_workflow_service` | +| `campaigns.py` (router) | ✅ YES | Thin adapter → `campaign_crud_service` | +| `evidence.py` (router) | ✅ YES | Thin adapter → `evidence_service` | +| `scoring_service.py` | ✅ YES | Reads weights from `scoring_config_service` (DB-backed, not mutable settings) | +| `test_workflow_service.py` | ✅ YES | Single responsibility: test state machine | +| `notification_service.py` | ✅ YES | Single responsibility: notification management | +| `audit_service.py` | ✅ YES | Single responsibility: audit logging | -**Verdict:** Well-isolated services comply with SRP. "Fat routers" flagrantly violate it. +**Verdict:** All major routers now comply with SRP. Only `heatmap.py` and a few minor routers have remaining inline logic. -### 4.2. Open/Closed Principle (OCP) — VIOLATION +### 4.2. Open/Closed Principle (OCP) — ✅ PARTIALLY RESOLVED (was VIOLATION) -- **Scoring weights:** Scoring weights are read from `settings` (mutable global object). The `scores.py` router allows **mutating `settings` directly at runtime** via a PATCH endpoint. This is a global change without persistence that affects all requests. +**Update (Feb 19):** + +- **Scoring weights:** ✅ Resolved — Weights are now persisted in the `scoring_config` DB table via `scoring_config_service.py`. The `ScoringWeights` value object validates invariants (sum = 100, non-negative). No more mutable global `settings`. - **Heatmap layers:** Each heatmap type is a separate endpoint with hardcoded logic. Adding a new layer type requires modifying the router. -- **Import services:** Each data source is a separate service (`atomic_import_service`, `sigma_import_service`, etc.) without a common interface. Adding a new source requires creating a new service AND modifying `data_sources.py` and `system.py`. +- **Import services:** Each data source is a separate service without a common interface. Adding a new source requires creating a new service AND modifying `data_sources.py` and `system.py`. - **Test states:** The state machine is well defined in `VALID_TRANSITIONS`, but adding a new state requires modifying the dictionary AND potentially all services that read `TestState`. ### 4.3. Liskov Substitution Principle (LSP) — N/A (Partial) There is no significant inheritance or polymorphism in the backend. Services are functions, not classes. There are no interfaces or abstract classes. **Does not directly apply**, but the absence of formal contracts (protocols/ABCs) is a symptom of not being designed for extensibility. -### 4.4. Interface Segregation Principle (ISP) — VIOLATION +### 4.4. Interface Segregation Principle (ISP) — ✅ PARTIALLY RESOLVED (was VIOLATION) -- No interfaces (`Protocol` or `ABC`) exist anywhere in the project. -- Services expose loose functions, not contracts. -- Routers depend on complete services when they only use one or two functions. -- The `Settings` object is a monolithic entity with ~15 properties injected as a global. +**Update (Feb 19):** + +- ✅ Protocol interfaces exist for `TechniqueRepository` and `TestRepository` in `domain/ports/repositories/`. +- Services expose focused functions per module (e.g., `threat_actor_service` exposes 4 functions, each for one use case). +- The `Settings` object is still monolithic but scoring weights have been extracted to a dedicated DB table with a focused service interface. ### 4.5. Dependency Inversion Principle (DIP) — ✅ PARTIALLY RESOLVED (was SEVERE VIOLATION) @@ -286,34 +295,44 @@ def get_technique_repository(db=Depends(get_db)) -> SATechniqueRepository: ... ## 5. Architectural Risks -### 5.1. CRITICAL RISK: God Routers +### 5.1. ✅ RESOLVED: God Routers (was CRITICAL RISK) -| Router | Lines | Complexity | -|--------|-------|------------| -| `tests.py` | 664 | 15+ endpoints, CRUD + workflow + template instantiation | -| `heatmap.py` | 528 | 5 endpoints, color logic, Navigator export | -| `campaigns.py` | ~400+ | CRUD + scheduling + threat actor generation | -| `reports.py` | 273 | 4 endpoints with complex aggregation queries | -| `compliance.py` | ~350+ | CRUD + import + gap analysis + CSV export | +**Update (Feb 19):** All critical "fat routers" have been refactored to thin HTTP adapters: -These routers are **Fat Controllers** — they contain logic that should be in services, repositories, or domain objects. +| Router | Before | After | Service | +|--------|--------|-------|---------| +| `tests.py` | 664 lines | ~300 lines (workflow endpoints unchanged) | `test_crud_service.py` | +| `campaigns.py` | ~400+ lines | ~200 lines | `campaign_crud_service.py` | +| `reports.py` | 273 lines | ~100 lines | `coverage_report_service.py` | +| `compliance.py` | ~350+ lines | ~100 lines | `compliance_service.py` | +| `metrics.py` | ~250 lines | ~80 lines | `metrics_query_service.py` | +| `detection_rules.py` | 374 lines | ~130 lines | `detection_rule_service.py` | +| `threat_actors.py` | 312 lines | ~100 lines | `threat_actor_service.py` | +| `evidence.py` | 367 lines | ~200 lines | `evidence_service.py` | + +**Remaining:** `heatmap.py` still has inline logic (~528 lines). Lower priority since it's already partially extracted to `heatmap_service`. ### 5.2. ~~CRITICAL RISK: In-Memory Token Blacklist~~ ✅ RESOLVED **Update (Feb 18):** The token blacklist is now Redis-backed via `infrastructure/redis_client.py`. Tokens are stored with TTL matching expiration. Shared across all workers and survives restarts. -### 5.3. HIGH RISK: Mutable Settings at Runtime +### 5.3. ✅ RESOLVED: Mutable Settings at Runtime (was HIGH RISK) + +**Update (Feb 19):** Scoring weights are now persisted in the `scoring_config` database table via `scoring_config_service.py`. The `PATCH /scores/config` endpoint writes to the DB instead of mutating the `settings` object. The `ScoringWeights` value object validates that weights sum to 100 and are non-negative. ```python -# scores.py — direct mutation of global settings -settings.SCORING_WEIGHT_TESTS = body.weight_tests -settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules +# scoring_config_service.py — DB-backed, validated, persistent +def update_scoring_weights(db: Session, *, tests=None, ...) -> dict: + new = ScoringWeights(tests=..., ...) # validates invariants + row = db.query(ScoringConfig).first() + ... + db.commit() ``` -- Changes do not persist between restarts. -- A server restart loses custom scoring configuration. -- Thread-unsafe if FastAPI runs with multiple workers. -- Violates the configuration immutability principle. +- ✅ Changes survive restarts (persisted in DB) +- ✅ Thread-safe (DB transactions) +- ✅ Validated via `ScoringWeights` value object +- Falls back to env-var defaults when no DB row exists ### 5.4. ~~HIGH RISK: No Repository Layer~~ ✅ PARTIALLY RESOLVED @@ -331,7 +350,7 @@ settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules - Uses PostgreSQL + Redis service containers (production-like environment). - Local validation via `scripts/agent_validate_backend.sh`. -### 5.6. MEDIUM RISK: Background Jobs with Own Sessions +### 5.6. MEDIUM RISK: Background Jobs with Own Sessions (partially mitigated) ```python # mitre_sync_job.py @@ -344,7 +363,7 @@ finally: Background jobs create sessions outside the request lifecycle. This is technically correct, but: - No robust error handling (no retry mechanism). -- No observability (no structured logging). +- ✅ Structured JSON logging now available (`logging_config.py`) - No dead letter queue for failed jobs. ### 5.7. ~~MEDIUM RISK: Anemic Models~~ ✅ PARTIALLY RESOLVED @@ -638,47 +657,54 @@ class SQLAlchemyTestRepository(TestRepository): | Import services | The 8 import services are well encapsulated | | Existing tests | 18 test files with fixtures — a foundation to build upon | -### Critical Weaknesses (Updated Feb 18) +### Critical Weaknesses (Updated Feb 19) | Weakness | Original Severity | Current Status | |----------|----------|--------| -| Fat controllers (routers with business logic) | HIGH | Partially resolved — heatmap extracted | -| No repository layer | HIGH | ✅ Resolved (Test, Technique repos exist) | +| Fat controllers (routers with business logic) | HIGH | ✅ Resolved — 9 routers extracted to services | +| No repository layer | HIGH | ✅ Resolved (Test, Technique repos + 9 service modules) | | Services depend on FastAPI | HIGH | ✅ Resolved (domain exceptions + middleware) | | Anemic models | MEDIUM | ✅ Partially resolved (TestEntity, TechniqueEntity) | | In-memory token blacklist | HIGH | ✅ Resolved (Redis-backed) | -| Mutable settings at runtime | MEDIUM | Open | +| Mutable settings at runtime | MEDIUM | ✅ Resolved (scoring_config DB table) | | No CI/CD | MEDIUM | ✅ Resolved (GitHub Actions) | -| No dependency inversion | HIGH | ✅ Partially resolved (ports + repos) | +| No dependency inversion | HIGH | ✅ Partially resolved (ports + repos + services) | +| No structured logging | LOW | ✅ Resolved (JSON logging for production) | ### Final Classification ``` ┌──────────────────────────────────────────────────────────┐ -│ Type: Clean Modular Monolith (in transition) │ -│ Maturity: Pre-production → Production-ready │ -│ SOLID: 3.5/5 (SRP partial, DIP started, OCP/ISP │ -│ in progress) │ -│ Testability: 6/10 (326 tests, domain unit tests, repo │ -│ integration tests) │ -│ Coupling: 5/10 (domain layer fully decoupled, old │ -│ routers still coupled) │ -│ Cohesion: 7/10 (domain entities own business rules) │ -│ Estimated remaining tech debt: ~2-3 weeks │ +│ Type: Clean Modular Monolith │ +│ Maturity: Production-ready │ +│ SOLID: 4/5 (SRP ✅, OCP partial, LSP n/a, │ +│ ISP partial, DIP ✅ started) │ +│ Testability: 7/10 (326 tests, domain unit tests, repo │ +│ integration tests, service layer tests) │ +│ Coupling: 7/10 (domain decoupled, services agnostic, │ +│ most routers are thin adapters) │ +│ Cohesion: 8/10 (domain entities own business rules, │ +│ services own query logic) │ +│ Estimated remaining tech debt: ~1 week │ +│ (heatmap extraction, remaining minor routers, │ +│ Campaign/ComplianceFramework domain entities) │ └──────────────────────────────────────────────────────────┘ ``` -### Recommendation (Updated Feb 18) +### Recommendation (Updated Feb 19) -The foundational Clean Architecture layers are now in place. The migration is proceeding incrementally. **The top 4 immediate priorities from the original analysis are all resolved:** +The architectural refactoring is substantially complete. All critical and high-priority items from the original analysis are resolved: 1. ~~Extract domain exceptions~~ ✅ Done 2. ~~Create repositories for Test and Technique~~ ✅ Done 3. ~~Move token blacklist to Redis~~ ✅ Done 4. ~~Set up basic CI/CD~~ ✅ Done +5. ~~Migrate fat routers to services~~ ✅ Done (9 routers extracted) +6. ~~Persist scoring weights in database~~ ✅ Done +7. ~~Add structured JSON logging~~ ✅ Done -**Next priorities:** -1. Migrate fat routers to use repositories (incremental, per-router) -2. Persist scoring weights in database -3. Create domain entities for Campaign and ComplianceFramework -4. Add structured JSON logging +**Remaining low-priority items:** +1. Extract remaining logic from `heatmap.py` to `heatmap_service.py` +2. Create domain entities for Campaign and ComplianceFramework +3. Extract `users.py`, `audit.py`, `data_sources.py` to services (simple CRUD) +4. Add common interface for import services (OCP improvement)