Files
Aegis/docs/ARCHITECTURAL_ANALYSIS.md
Kitos 0b65f51d1c
Some checks failed
Aegis CI / lint-and-test (push) Has been cancelled
docs: update architecture analysis and tech debt docs to reflect resolved items
2026-02-18 19:27:52 +01:00

685 lines
33 KiB
Markdown

# Aegis — Deep Architectural Analysis
> **Author:** Automated architecture review
> **Date:** February 11, 2026 (updated February 18, 2026)
> **Scope:** Backend (FastAPI/Python), Frontend (React/TypeScript), Infrastructure (Docker)
>
> **Note:** Sections marked with ✅ reflect changes implemented since the initial analysis.
---
## Table of Contents
1. [Current Architecture](#1-current-architecture)
2. [Coupling Analysis](#2-coupling-analysis)
3. [Business Logic vs Infrastructure Separation](#3-business-logic-vs-infrastructure-separation)
4. [SOLID Evaluation](#4-solid-evaluation)
5. [Architectural Risks](#5-architectural-risks)
6. [Refactor Proposal Towards Clean Architecture](#6-refactor-proposal-towards-clean-architecture)
7. [Executive Summary](#7-executive-summary)
---
## 1. Current Architecture
### 1.1. Classification: Layered Monolith with Incomplete Service Layer
Aegis follows a **layered monolithic architecture** deployed as two containers (backend + frontend) with a **partial and inconsistent** level of separation. It is not Clean Architecture, nor Hexagonal, nor microservices.
```
┌─────────────────────────────────────────────────┐
│ FRONTEND │
│ React 19 + TypeScript + Vite │
│ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │
│ │ Pages │→ │ API Layer│→ │ Axios Client │ │
│ │(21 pages)│ │(22 mods) │ │(HttpOnly JWT) │ │
│ └──────────┘ └──────────┘ └───────────────┘ │
└────────────────────────┬────────────────────────┘
│ HTTP/REST
┌────────────────────────▼────────────────────────┐
│ BACKEND │
│ FastAPI + SQLAlchemy │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Router Layer (21 routers) │ │
│ │ Contains: validation, queries, partial │ │
│ │ business logic, serialization, auditing │ │
│ └────────┬──────────────────┬─────────────────┘ │
│ │ │ │
│ ┌────────▼───────┐ ┌──────▼──────────────────┐ │
│ │ Service Layer │ │ Direct DB Access │ │
│ │ (20 services) │ │ (SQLAlchemy queries │ │
│ │ Partial: only │ │ inside routers) │ │
│ │ for workflows │ │ │ │
│ └────────┬───────┘ └──────┬──────────────────┘ │
│ │ │ │
│ ┌────────▼──────────────────▼─────────────────┐ │
│ │ Model Layer (18 models) │ │
│ │ SQLAlchemy ORM — Anemic Domain Models │ │
│ └────────────────────┬────────────────────────┘ │
│ │ │
│ ┌────────────────────▼────────────────────────┐ │
│ │ Database Layer │ │
│ │ PostgreSQL + MinIO (evidence storage) │ │
│ └─────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
```
### 1.2. Actual Distribution of Responsibilities
| Layer | Files | Actual Responsibility |
|-------|-------|----------------------|
| **Routers** | 21 files | Validation, auth, direct SQL queries, partial business logic, serialization, CSV/JSON report generation |
| **Services** | 20 files | Complex workflows (test state machine, scoring, notifications), external data source imports |
| **Models** | 18 files | ORM table definitions — purely anemic (no behavior) |
| **Schemas** | 10 files | Pydantic DTOs for request/response |
| **Database** | 1 file | Session factory and `get_db()` generator |
### 1.3. The Core Problem: Two Coexisting Patterns
Aegis has a **split architectural personality**:
**Pattern A — Router-as-Controller (direct CRUD):**
Routers like `techniques.py`, `evidence.py`, `users.py`, `audit.py`, `reports.py`, `heatmap.py`, `metrics.py`, `detection_rules.py`, `threat_actors.py` execute SQLAlchemy queries directly:
```python
# techniques.py — direct query inside the router
query = db.query(Technique)
if tactic is not None:
query = query.filter(Technique.tactic == tactic)
return query.order_by(Technique.mitre_id).all()
```
**Pattern B — Router-delegates-to-Service:**
Routers like `tests.py`, `scores.py`, `notifications.py`, `campaigns.py` delegate to services:
```python
# tests.py — delegates to workflow service
wf_start_execution(db=db, test=test, user=current_user)
```
**The result:** There is no clear contract about where logic lives. A new developer cannot predict whether to look for logic in the router or in a service.
---
## 2. Coupling Analysis
### 2.1. Coupling Matrix
```
Routers Services Models Database Schemas Config
Routers — MEDIUM HIGH HIGH HIGH LOW
Services LOW — HIGH HIGH NONE MEDIUM
Models NONE NONE — HIGH NONE NONE
Schemas NONE NONE LOW — NONE NONE
Database NONE NONE NONE — NONE LOW
```
### 2.2. Router ↔ Model — HIGH COUPLING (Critical)
Routers import and use SQLAlchemy models directly. **11 out of 21 routers** execute SQL queries without an intermediary:
| Router | Directly Imported Models | Queries Inside Router |
|--------|--------------------------|----------------------|
| `techniques.py` | Technique | `db.query(Technique).filter(...)` |
| `evidence.py` | Evidence, Test | `db.query(Evidence).filter(...)` |
| `users.py` | User | `db.query(User).filter(...)` |
| `audit.py` | AuditLog | `db.query(AuditLog).filter(...)` |
| `reports.py` | Technique, Test | `db.query(Technique)...`, `db.query(Test)...` |
| `heatmap.py` | Technique, Test, ThreatActor, DetectionRule, Campaign, DefensiveTechniqueMapping | Multiple complex queries |
| `metrics.py` | Technique, Test | Aggregations with `func.count` |
| `detection_rules.py` | DetectionRule, TestDetectionResult | Direct CRUD |
| `threat_actors.py` | ThreatActor, ThreatActorTechnique, Technique | Queries with joins |
| `data_sources.py` | DataSource, Technique, Test | CRUD + stats queries |
| `compliance.py` | ComplianceFramework, ComplianceControl, etc. | Compliance queries |
**Impact:** Changing a table schema requires modifying both the model and every router that queries it directly. There is no indirection.
### 2.3. Router ↔ Database — HIGH COUPLING
All routers receive `db: Session = Depends(get_db)` and operate with the SQLAlchemy session directly. This means:
- Routers know the ORM (`db.query`, `db.add`, `db.commit`, `joinedload`)
- Routers handle transactions implicitly
- There is no persistence abstraction — migrating from SQLAlchemy to another ORM or raw queries would require rewriting **all** routers
### 2.4. Service ↔ Model/Database — HIGH COUPLING
Services also access SQLAlchemy directly:
```python
# scoring_service.py
all_tests = db.query(Test).filter(Test.technique_id == technique.id).all()
# notification_service.py
notif = db.query(Notification).filter(...).first()
```
Services do not use repositories or abstractions — they are essentially functions that orchestrate queries and logic.
### 2.5. Service ↔ Service — MEDIUM COUPLING
Inter-service coupling exists:
- `test_workflow_service``audit_service` + `notification_service`
- `scoring_service` reads from `settings` directly (mutable global config)
- `campaign_scheduler_service``campaign_service`
There is no dependency injection between services — everything is direct imports.
### 2.6. Service ↔ Framework — ✅ RESOLVED (was HIGH COUPLING)
~~Domain services import `HTTPException` from FastAPI.~~
**Update (Feb 18):** `test_workflow_service.py` now raises domain exceptions (`InvalidOperationError`, `InvalidStateTransition`) from `app.domain.exceptions`. The `middleware/error_handler.py` maps these to HTTP responses automatically. Services no longer import `HTTPException`.
```python
# Current: domain/errors.py exceptions mapped by middleware
raise InvalidStateTransition(current_state=..., target_state=..., entity_type="Test")
# middleware/error_handler.py → 400 Bad Request automatically
```
### 2.7. Frontend ↔ Backend — LOW COUPLING (Correct)
Communication is via REST API with aligned but independent types (`types/models.ts` vs `schemas/*.py`). The frontend uses Axios with interceptors — good decoupling.
---
## 3. Business Logic vs Infrastructure Separation
### 3.1. Diagnosis: INSUFFICIENT SEPARATION
| Aspect | Status | Detail |
|--------|--------|--------|
| **Workflow logic** | PARTIAL | `test_workflow_service.py` correctly encapsulates the state machine. It is the best designed service. |
| **Scoring** | PARTIAL | `scoring_service.py` encapsulates calculations but accesses DB directly and reads `settings` as mutable global state. |
| **CRUD** | NOT SEPARATED | CRUD operations live in routers, mixed with HTTP concerns. |
| **Report generation** | NOT SEPARATED | `reports.py` (router) builds complex CSVs and JSONs with inline queries of 50+ lines. |
| **Heatmap/visualization** | NOT SEPARATED | `heatmap.py` (router) has ~500 lines with all ATT&CK Navigator mapping logic embedded. |
| **Metrics** | NOT SEPARATED | `metrics.py` and `operational_metrics.py` (routers) have complex aggregation queries. |
| **Data import** | WELL SEPARATED | The 8 import services (`atomic_import_service`, `sigma_import_service`, etc.) are correctly isolated. |
| **Notifications** | WELL SEPARATED | `notification_service.py` encapsulates all logic. |
| **Auditing** | WELL SEPARATED | `audit_service.py` is a pure `log_action()` function. |
### 3.2. Anemic Model (Anti-pattern)
SQLAlchemy models are purely declarative — they have no business methods:
```python
# models/test.py — columns only, zero behavior
class Test(Base):
__tablename__ = "tests"
id = Column(UUID, primary_key=True)
state = Column(Enum(TestState))
# ... more columns
# Missing: can_transition(), validate(), calculate_score()
```
Logic that should be in domain models (business validations, state transitions, calculations) is scattered across routers and services.
### 3.3. Infrastructure Bleeding Into Logic
| Infrastructure | Where It Appears Inappropriately |
|---------------|--------------------------------|
| `SQLAlchemy Session` | Inside domain services (scoring, workflow, notifications) |
| `FastAPI HTTPException` | Inside domain services (test_workflow_service) |
| `MinIO/boto3` | `storage.py` is well isolated, but called from routers directly |
| `APScheduler` | Directly coupled in `jobs/mitre_sync_job.py` with `SessionLocal()` |
---
## 4. SOLID Evaluation
### 4.1. Single Responsibility Principle (SRP) — PARTIAL VIOLATION
| Component | Compliant? | Issue |
|-----------|-----------|-------|
| `heatmap.py` (router) | NO | 528 lines — HTTP handling + query building + color mapping + Navigator JSON serialization + export logic |
| `reports.py` (router) | NO | HTTP handling + aggregation queries + CSV generation + JSON formatting |
| `tests.py` (router) | PARTIAL | Delegates workflow but maintains CRUD, template instantiation, timeline queries |
| `scoring_service.py` | PARTIAL | Scoring + mutable global config reading + direct queries |
| `test_workflow_service.py` | YES | Single responsibility: test state machine |
| `notification_service.py` | YES | Single responsibility: notification management |
| `audit_service.py` | YES | Single responsibility: audit logging |
**Verdict:** Well-isolated services comply with SRP. "Fat routers" flagrantly violate it.
### 4.2. Open/Closed Principle (OCP) — VIOLATION
- **Scoring weights:** Scoring weights are read from `settings` (mutable global object). The `scores.py` router allows **mutating `settings` directly at runtime** via a PATCH endpoint. This is a global change without persistence that affects all requests.
- **Heatmap layers:** Each heatmap type is a separate endpoint with hardcoded logic. Adding a new layer type requires modifying the router.
- **Import services:** Each data source is a separate service (`atomic_import_service`, `sigma_import_service`, etc.) without a common interface. Adding a new source requires creating a new service AND modifying `data_sources.py` and `system.py`.
- **Test states:** The state machine is well defined in `VALID_TRANSITIONS`, but adding a new state requires modifying the dictionary AND potentially all services that read `TestState`.
### 4.3. Liskov Substitution Principle (LSP) — N/A (Partial)
There is no significant inheritance or polymorphism in the backend. Services are functions, not classes. There are no interfaces or abstract classes. **Does not directly apply**, but the absence of formal contracts (protocols/ABCs) is a symptom of not being designed for extensibility.
### 4.4. Interface Segregation Principle (ISP) — VIOLATION
- No interfaces (`Protocol` or `ABC`) exist anywhere in the project.
- Services expose loose functions, not contracts.
- Routers depend on complete services when they only use one or two functions.
- The `Settings` object is a monolithic entity with ~15 properties injected as a global.
### 4.5. Dependency Inversion Principle (DIP) — ✅ PARTIALLY RESOLVED (was SEVERE VIOLATION)
**Update (Feb 18):** Protocol interfaces and abstractions now exist:
```python
# domain/ports/repositories/ — Protocol interfaces
class TechniqueRepository(Protocol):
def find_by_id(self, technique_id: UUID) -> TechniqueEntity | None: ...
def save(self, technique: TechniqueEntity) -> TechniqueEntity: ...
# dependencies/repositories.py — FastAPI Depends() wiring
def get_technique_repository(db=Depends(get_db)) -> SATechniqueRepository: ...
```
- **Domain layer** has zero framework imports (no FastAPI, no SQLAlchemy).
- **Repository ports** define contracts; infrastructure implements them.
- `test_workflow_service.py` now uses domain exceptions instead of `HTTPException`.
- `UnitOfWork` manages transactions.
**Remaining:** Some services still use direct imports for `audit_service`, `notification_service`. Full DIP adoption is incremental.
---
## 5. Architectural Risks
### 5.1. CRITICAL RISK: God Routers
| Router | Lines | Complexity |
|--------|-------|------------|
| `tests.py` | 664 | 15+ endpoints, CRUD + workflow + template instantiation |
| `heatmap.py` | 528 | 5 endpoints, color logic, Navigator export |
| `campaigns.py` | ~400+ | CRUD + scheduling + threat actor generation |
| `reports.py` | 273 | 4 endpoints with complex aggregation queries |
| `compliance.py` | ~350+ | CRUD + import + gap analysis + CSV export |
These routers are **Fat Controllers** — they contain logic that should be in services, repositories, or domain objects.
### 5.2. ~~CRITICAL RISK: In-Memory Token Blacklist~~ ✅ RESOLVED
**Update (Feb 18):** The token blacklist is now Redis-backed via `infrastructure/redis_client.py`. Tokens are stored with TTL matching expiration. Shared across all workers and survives restarts.
### 5.3. HIGH RISK: Mutable Settings at Runtime
```python
# scores.py — direct mutation of global settings
settings.SCORING_WEIGHT_TESTS = body.weight_tests
settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules
```
- Changes do not persist between restarts.
- A server restart loses custom scoring configuration.
- Thread-unsafe if FastAPI runs with multiple workers.
- Violates the configuration immutability principle.
### 5.4. ~~HIGH RISK: No Repository Layer~~ ✅ PARTIALLY RESOLVED
**Update (Feb 18):** Repository ports and implementations now exist:
- `domain/ports/repositories/` — Protocol interfaces for `TechniqueRepository` and `TestRepository`.
- `infrastructure/persistence/repositories/` — SQLAlchemy implementations (`SATechniqueRepository`, `SATestRepository`) with batch query methods.
- `dependencies/repositories.py` — FastAPI `Depends()` wiring.
**Remaining:** Old routers still use direct `db.query()`. Migration is incremental — new endpoints use repositories, old ones coexist.
### 5.5. ~~HIGH RISK: No CI/CD~~ ✅ RESOLVED
**Update (Feb 18):** GitHub Actions CI pipeline exists at `.github/workflows/ci.yml`:
- Runs `ruff` lint + `pytest` on every push/PR.
- Uses PostgreSQL + Redis service containers (production-like environment).
- Local validation via `scripts/agent_validate_backend.sh`.
### 5.6. MEDIUM RISK: Background Jobs with Own Sessions
```python
# mitre_sync_job.py
db = SessionLocal()
try:
sync_mitre(db)
finally:
db.close()
```
Background jobs create sessions outside the request lifecycle. This is technically correct, but:
- No robust error handling (no retry mechanism).
- No observability (no structured logging).
- No dead letter queue for failed jobs.
### 5.7. ~~MEDIUM RISK: Anemic Models~~ ✅ PARTIALLY RESOLVED
**Update (Feb 18):** Rich domain entities now exist alongside ORM models:
- `domain/test_entity.py` — Full state machine with business logic, domain events, dual validation, timers.
- `domain/entities/technique.py` — Status recalculation, review lifecycle, MITRE ID validation.
- `domain/value_objects/``MitreId`, `ScoringWeights` (immutable, validated).
- ORM models remain anemic by design (persistence mapping only). Business logic lives in domain entities.
**Remaining:** Campaign, ComplianceFramework, ThreatActor still lack domain entity counterparts.
### 5.8. ~~MEDIUM RISK: No Explicit Transaction Management~~ ✅ PARTIALLY RESOLVED
**Update (Feb 18):** A `UnitOfWork` context manager exists at `domain/unit_of_work.py` with explicit `commit()`, `rollback()`, and `flush()`. Used by `test_workflow_service.py` which explicitly states "The caller (router) is responsible for committing the session via the Unit of Work pattern."
**Remaining:** Some services like `audit_service.py` still call `db.commit()` directly. Needs incremental migration.
### 5.9. LOW RISK: No Semantic API Versioning
The API is under `/api/v1` but there is no mechanism to support v2 without duplicating entire routers.
---
## 6. Refactor Proposal Towards Clean Architecture
### 6.1. Target Structure
```
backend/
├── app/
│ ├── main.py # FastAPI setup (minimal)
│ ├── config.py # Settings (immutable)
│ │
│ ├── domain/ # ★ DOMAIN LAYER (no external dependencies)
│ │ ├── entities/ # Entities with behavior
│ │ │ ├── test.py # Test entity with can_transition(), validate()
│ │ │ ├── technique.py # Technique with calculate_status()
│ │ │ ├── campaign.py # Campaign with add_test(), activate()
│ │ │ └── ...
│ │ ├── value_objects/ # Immutable value objects
│ │ │ ├── score.py # TechniqueScore, OrganizationScore
│ │ │ ├── test_state.py # TestState with valid transitions
│ │ │ └── mitre_id.py # MitreId with validation
│ │ ├── exceptions.py # Domain exceptions (NOT HTTPException)
│ │ │ # InvalidTransitionError, EntityNotFoundError, etc.
│ │ ├── events.py # Domain events
│ │ │ # TestValidated, TestRejected, CampaignCompleted
│ │ └── ports/ # ★ INTERFACES (ABCs / Protocols)
│ │ ├── repositories/
│ │ │ ├── test_repository.py # ABC: find_by_id(), save(), list_by_technique()
│ │ │ ├── technique_repository.py
│ │ │ ├── campaign_repository.py
│ │ │ └── ...
│ │ ├── services/
│ │ │ ├── storage_port.py # ABC: upload_file(), get_presigned_url()
│ │ │ ├── notification_port.py # ABC: send_notification()
│ │ │ └── event_bus_port.py # ABC: publish(event)
│ │ └── auth/
│ │ └── token_service_port.py
│ │
│ ├── application/ # ★ APPLICATION LAYER (use cases)
│ │ ├── use_cases/
│ │ │ ├── tests/
│ │ │ │ ├── create_test.py # CreateTestUseCase
│ │ │ │ ├── start_execution.py # StartExecutionUseCase
│ │ │ │ ├── submit_red.py
│ │ │ │ ├── validate_test.py
│ │ │ │ └── get_retest_chain.py
│ │ │ ├── scoring/
│ │ │ │ ├── calculate_technique_score.py
│ │ │ │ └── calculate_organization_score.py
│ │ │ ├── campaigns/
│ │ │ │ ├── create_campaign.py
│ │ │ │ └── generate_from_threat_actor.py
│ │ │ ├── heatmap/
│ │ │ │ ├── generate_coverage_layer.py
│ │ │ │ └── export_navigator.py
│ │ │ └── reports/
│ │ │ ├── generate_coverage_report.py
│ │ │ └── export_coverage_csv.py
│ │ ├── dto/ # Input/Output DTOs for use cases
│ │ │ ├── test_dto.py
│ │ │ └── ...
│ │ └── interfaces/ # Application-level ports
│ │ └── unit_of_work.py # ABC: UnitOfWork with commit/rollback
│ │
│ ├── infrastructure/ # ★ INFRASTRUCTURE LAYER (implementations)
│ │ ├── persistence/
│ │ │ ├── orm/ # SQLAlchemy models (mapping only)
│ │ │ │ ├── test_model.py
│ │ │ │ ├── technique_model.py
│ │ │ │ └── ...
│ │ │ ├── repositories/ # Concrete implementations
│ │ │ │ ├── sqlalchemy_test_repository.py
│ │ │ │ ├── sqlalchemy_technique_repository.py
│ │ │ │ └── ...
│ │ │ ├── unit_of_work.py # SQLAlchemy UoW implementation
│ │ │ └── database.py # Engine, session factory
│ │ ├── storage/
│ │ │ └── minio_storage.py # Implements StoragePort
│ │ ├── external/ # Import services
│ │ │ ├── mitre_sync.py
│ │ │ ├── atomic_import.py
│ │ │ ├── sigma_import.py
│ │ │ └── ...
│ │ ├── auth/
│ │ │ ├── jwt_service.py # Implements TokenServicePort
│ │ │ └── token_blacklist.py # Redis-backed blacklist
│ │ ├── notifications/
│ │ │ └── db_notification_service.py
│ │ ├── jobs/
│ │ │ └── scheduler.py # APScheduler setup
│ │ └── cache/
│ │ └── redis_cache.py # Score caching (Redis)
│ │
│ └── presentation/ # ★ PRESENTATION LAYER (HTTP)
│ ├── api/
│ │ ├── v1/
│ │ │ ├── tests.py # Routing + request/response mapping only
│ │ │ ├── techniques.py
│ │ │ ├── heatmap.py
│ │ │ └── ...
│ │ └── dependencies.py # FastAPI Depends() wiring
│ ├── schemas/ # Pydantic schemas (request/response)
│ │ ├── test_schema.py
│ │ └── ...
│ ├── middleware/
│ │ ├── error_handler.py # Domain exceptions → HTTP responses
│ │ └── rate_limiter.py
│ └── mappers/ # Entity ↔ Schema mappers
│ ├── test_mapper.py
│ └── ...
```
### 6.2. Dependency Rules
```
Presentation → Application → Domain ← Infrastructure
↓ ↓ ↑ ↑
FastAPI Use Cases Entities SQLAlchemy
Pydantic DTOs Ports MinIO
Redis
APScheduler
```
**The golden rule:** Dependencies only point towards the center (Domain). Infrastructure implements the ports defined in Domain.
### 6.3. Key Changes by Layer
#### Domain Layer (New)
```python
# domain/entities/test.py — Rich entity (not anemic)
class TestEntity:
def __init__(self, id, state, technique_id, ...):
self._state = state
def can_transition_to(self, target: TestState) -> bool:
return target in VALID_TRANSITIONS[self._state]
def start_execution(self, user: UserEntity) -> list[DomainEvent]:
if not self.can_transition_to(TestState.red_executing):
raise InvalidTransitionError(self._state, TestState.red_executing)
self._state = TestState.red_executing
return [TestExecutionStarted(test_id=self.id, user_id=user.id)]
# domain/exceptions.py — Domain exceptions, NOT HTTPException
class InvalidTransitionError(DomainException):
def __init__(self, current: TestState, target: TestState):
self.current = current
self.target = target
# domain/ports/repositories/test_repository.py — Abstract interface
class TestRepository(Protocol):
def find_by_id(self, test_id: UUID) -> TestEntity | None: ...
def save(self, test: TestEntity) -> None: ...
def list_by_technique(self, technique_id: UUID) -> list[TestEntity]: ...
```
#### Application Layer (Use Cases)
```python
# application/use_cases/tests/start_execution.py
class StartExecutionUseCase:
def __init__(self, test_repo: TestRepository, uow: UnitOfWork):
self._test_repo = test_repo
self._uow = uow
def execute(self, test_id: UUID, user_id: UUID) -> TestDTO:
with self._uow:
test = self._test_repo.find_by_id(test_id)
if not test:
raise EntityNotFoundError("Test", test_id)
events = test.start_execution(user)
self._test_repo.save(test)
self._uow.commit()
# events are published after commit
return TestDTO.from_entity(test)
```
#### Presentation Layer (Slim Routers)
```python
# presentation/api/v1/tests.py — HTTP concerns only
@router.post("/{test_id}/start-execution")
def start_execution(
test_id: UUID,
use_case: StartExecutionUseCase = Depends(get_start_execution_use_case),
current_user: User = Depends(get_current_user),
):
try:
result = use_case.execute(test_id, current_user.id)
return result
except EntityNotFoundError:
raise HTTPException(404)
except InvalidTransitionError as e:
raise HTTPException(400, detail=str(e))
```
#### Infrastructure Layer (Implementations)
```python
# infrastructure/persistence/repositories/sqlalchemy_test_repository.py
class SQLAlchemyTestRepository(TestRepository):
def __init__(self, session: Session):
self._session = session
def find_by_id(self, test_id: UUID) -> TestEntity | None:
model = self._session.query(TestModel).filter(TestModel.id == test_id).first()
return TestMapper.to_entity(model) if model else None
def save(self, test: TestEntity) -> None:
model = TestMapper.to_model(test)
self._session.merge(model)
```
### 6.4. Incremental Migration Plan (Phases)
**The refactor must be incremental — not big bang.** Each phase delivers value and the system continues working.
#### Phase 1: Foundations (1-2 weeks)
1. Create the directory structure: `domain/`, `application/`, `infrastructure/`, `presentation/`.
2. Create `domain/exceptions.py` with domain exceptions.
3. Create `error_handler.py` middleware that maps domain exceptions → HTTP responses.
4. Create `domain/ports/repositories/` with Protocol interfaces for the 3-4 most used entities (Test, Technique, Campaign).
5. Create SQLAlchemy implementations of these repositories.
6. **Do not move routers yet.**
#### Phase 2: Extract the Test Domain (1-2 weeks)
1. Create `domain/entities/test.py` with the state machine (extract from `test_workflow_service`).
2. Create use cases for each state transition.
3. Migrate the `tests.py` router to use the use cases.
4. Remove `HTTPException` from `test_workflow_service`.
5. **Pure unit tests** for the domain entity (no DB).
#### Phase 3: Extract Fat Services from Routers (2-3 weeks)
1. Move `heatmap.py` logic to `application/use_cases/heatmap/`.
2. Move `reports.py` logic to `application/use_cases/reports/`.
3. Move `metrics.py` logic to application services.
4. Routers become thin controllers (< 20 lines per endpoint).
#### Phase 4: Complete Repository Pattern (1-2 weeks)
1. Create repositories for all remaining entities.
2. Migrate scattered queries from routers to repositories.
3. Remove `db.query(...)` from any file outside `infrastructure/`.
#### Phase 5: Robust Infrastructure (1-2 weeks)
1. Move token blacklist to Redis.
2. Implement the Unit of Work pattern.
3. Move scoring config to the database (not mutable `settings`).
4. Add event bus for domain events (notifications, auditing).
#### Phase 6: CI/CD and Observability
1. Set up GitHub Actions (lint, type check, tests).
2. Add structured logging.
3. Add improved health checks.
---
## 7. Executive Summary
### Current Strengths
| Strength | Detail |
|----------|--------|
| Well-modeled domain | The data model covers ATT&CK, D3FEND, compliance, threat actors, and campaigns comprehensively |
| Solid test workflow | The state machine in `test_workflow_service` is the best designed component |
| Clean frontend | API/pages/components separation with TanStack Query is correct |
| Secure auth | HttpOnly cookies + RBAC with 6 well-defined roles |
| Import services | The 8 import services are well encapsulated |
| Existing tests | 18 test files with fixtures — a foundation to build upon |
### Critical Weaknesses (Updated Feb 18)
| Weakness | Original Severity | Current Status |
|----------|----------|--------|
| Fat controllers (routers with business logic) | HIGH | Partially resolved — heatmap extracted |
| No repository layer | HIGH | ✅ Resolved (Test, Technique repos exist) |
| Services depend on FastAPI | HIGH | ✅ Resolved (domain exceptions + middleware) |
| Anemic models | MEDIUM | ✅ Partially resolved (TestEntity, TechniqueEntity) |
| In-memory token blacklist | HIGH | ✅ Resolved (Redis-backed) |
| Mutable settings at runtime | MEDIUM | Open |
| No CI/CD | MEDIUM | ✅ Resolved (GitHub Actions) |
| No dependency inversion | HIGH | ✅ Partially resolved (ports + repos) |
### Final Classification
```
┌──────────────────────────────────────────────────────────┐
│ Type: Clean Modular Monolith (in transition) │
│ Maturity: Pre-production → Production-ready │
│ SOLID: 3.5/5 (SRP partial, DIP started, OCP/ISP │
│ in progress) │
│ Testability: 6/10 (326 tests, domain unit tests, repo │
│ integration tests) │
│ Coupling: 5/10 (domain layer fully decoupled, old │
│ routers still coupled) │
│ Cohesion: 7/10 (domain entities own business rules) │
│ Estimated remaining tech debt: ~2-3 weeks │
└──────────────────────────────────────────────────────────┘
```
### Recommendation (Updated Feb 18)
The foundational Clean Architecture layers are now in place. The migration is proceeding incrementally. **The top 4 immediate priorities from the original analysis are all resolved:**
1. ~~Extract domain exceptions~~ ✅ Done
2. ~~Create repositories for Test and Technique~~ ✅ Done
3. ~~Move token blacklist to Redis~~ ✅ Done
4. ~~Set up basic CI/CD~~ ✅ Done
**Next priorities:**
1. Migrate fat routers to use repositories (incremental, per-router)
2. Persist scoring weights in database
3. Create domain entities for Campaign and ComplianceFramework
4. Add structured JSON logging