# Aegis — Deep Architectural Analysis > **Author:** Automated architecture review > **Date:** February 11, 2026 (updated February 19, 2026) > **Scope:** Backend (FastAPI/Python), Frontend (React/TypeScript), Infrastructure (Docker) > > **Note:** Sections marked with ✅ reflect changes implemented since the initial analysis. --- ## Table of Contents 1. [Current Architecture](#1-current-architecture) 2. [Coupling Analysis](#2-coupling-analysis) 3. [Business Logic vs Infrastructure Separation](#3-business-logic-vs-infrastructure-separation) 4. [SOLID Evaluation](#4-solid-evaluation) 5. [Architectural Risks](#5-architectural-risks) 6. [Refactor Proposal Towards Clean Architecture](#6-refactor-proposal-towards-clean-architecture) 7. [Executive Summary](#7-executive-summary) --- ## 1. Current Architecture ### 1.1. Classification: Layered Monolith with Incomplete Service Layer Aegis follows a **layered monolithic architecture** deployed as two containers (backend + frontend) with a **partial and inconsistent** level of separation. It is not Clean Architecture, nor Hexagonal, nor microservices. ``` ┌─────────────────────────────────────────────────┐ │ FRONTEND │ │ React 19 + TypeScript + Vite │ │ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │ │ │ Pages │→ │ API Layer│→ │ Axios Client │ │ │ │(21 pages)│ │(22 mods) │ │(HttpOnly JWT) │ │ │ └──────────┘ └──────────┘ └───────────────┘ │ └────────────────────────┬────────────────────────┘ │ HTTP/REST ┌────────────────────────▼────────────────────────┐ │ BACKEND │ │ FastAPI + SQLAlchemy │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Router Layer (21 routers) │ │ │ │ Contains: validation, queries, partial │ │ │ │ business logic, serialization, auditing │ │ │ └────────┬──────────────────┬─────────────────┘ │ │ │ │ │ │ ┌────────▼───────┐ ┌──────▼──────────────────┐ │ │ │ Service Layer │ │ Direct DB Access │ │ │ │ (20 services) │ │ (SQLAlchemy queries │ │ │ │ Partial: only │ │ inside routers) │ │ │ │ for workflows │ │ │ │ │ └────────┬───────┘ └──────┬──────────────────┘ │ │ │ │ │ │ ┌────────▼──────────────────▼─────────────────┐ │ │ │ Model Layer (18 models) │ │ │ │ SQLAlchemy ORM — Anemic Domain Models │ │ │ └────────────────────┬────────────────────────┘ │ │ │ │ │ ┌────────────────────▼────────────────────────┐ │ │ │ Database Layer │ │ │ │ PostgreSQL + MinIO (evidence storage) │ │ │ └─────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────┘ ``` ### 1.2. Actual Distribution of Responsibilities | Layer | Files | Actual Responsibility | |-------|-------|----------------------| | **Routers** | 21 files | ✅ Thin HTTP adapters — auth, param parsing, response formatting. Delegate to services. | | **Services** | 30+ files | ✅ All business logic, query orchestration, domain validation. Framework-agnostic. | | **Domain** | 8+ files | ✅ Pure entities, value objects, ports, errors. Zero framework imports. | | **Infrastructure** | 5+ files | ✅ Repository implementations, Redis client, mappers. | | **Models** | 19 files | ORM table definitions — persistence mapping only | | **Schemas** | 10 files | Pydantic DTOs for request/response | | **Database** | 1 file | Session factory and `get_db()` generator | ### 1.3. ✅ Consistent Delegation Pattern (was: Two Coexisting Patterns) **Update (Feb 19):** The "split architectural personality" has been resolved. All major routers now follow the same pattern: **Pattern — Router-delegates-to-Service:** Routers are thin HTTP adapters that parse parameters, authenticate, and delegate to framework-agnostic services: ```python # threat_actors.py — thin adapter, all logic in service @router.get("/{actor_id}") def get_threat_actor(actor_id: str, db=Depends(get_db), current_user=Depends(get_current_user)): return get_actor_detail(db, actor_id) ``` Extracted services: `coverage_report_service`, `metrics_query_service`, `compliance_service`, `detection_rule_service`, `threat_actor_service`, `test_crud_service`, `evidence_service`, `campaign_crud_service`, `scoring_config_service`. **Remaining:** `users.py`, `audit.py`, `data_sources.py`, `heatmap.py` still have direct queries. These are lower priority since they are simpler or already partially extracted. --- ## 2. Coupling Analysis ### 2.1. Coupling Matrix ``` Routers Services Models Database Schemas Config Routers — MEDIUM HIGH HIGH HIGH LOW Services LOW — HIGH HIGH NONE MEDIUM Models NONE NONE — HIGH NONE NONE Schemas NONE NONE LOW — NONE NONE Database NONE NONE NONE — NONE LOW ``` ### 2.2. Router ↔ Model — ✅ LARGELY RESOLVED (was HIGH COUPLING) **Update (Feb 19):** Most routers no longer import ORM models or execute queries directly. Only **4 out of 21 routers** still have direct DB access: | Router | Status | Detail | |--------|--------|--------| | `techniques.py` | ✅ Extracted | Uses `SATechniqueRepository` via dependency injection | | `reports.py` | ✅ Extracted | Delegates to `coverage_report_service` | | `metrics.py` | ✅ Extracted | Delegates to `metrics_query_service` | | `compliance.py` | ✅ Extracted | Delegates to `compliance_service` | | `detection_rules.py` | ✅ Extracted | Delegates to `detection_rule_service` | | `threat_actors.py` | ✅ Extracted | Delegates to `threat_actor_service` | | `tests.py` | ✅ Extracted | Delegates to `test_crud_service` + `test_workflow_service` | | `evidence.py` | ✅ Extracted | Delegates to `evidence_service` | | `campaigns.py` | ✅ Extracted | Delegates to `campaign_crud_service` | | `users.py` | Remaining | Direct queries (simple CRUD) | | `audit.py` | Remaining | Direct queries (read-only list) | | `data_sources.py` | Remaining | Direct queries | | `heatmap.py` | Remaining | Complex queries (partially extracted via `heatmap_service`) | ### 2.3. Router ↔ Database — HIGH COUPLING All routers receive `db: Session = Depends(get_db)` and operate with the SQLAlchemy session directly. This means: - Routers know the ORM (`db.query`, `db.add`, `db.commit`, `joinedload`) - Routers handle transactions implicitly - There is no persistence abstraction — migrating from SQLAlchemy to another ORM or raw queries would require rewriting **all** routers ### 2.4. Service ↔ Model/Database — HIGH COUPLING Services also access SQLAlchemy directly: ```python # scoring_service.py all_tests = db.query(Test).filter(Test.technique_id == technique.id).all() # notification_service.py notif = db.query(Notification).filter(...).first() ``` Services do not use repositories or abstractions — they are essentially functions that orchestrate queries and logic. ### 2.5. Service ↔ Service — MEDIUM COUPLING Inter-service coupling exists: - `test_workflow_service` → `audit_service` + `notification_service` - `scoring_service` reads from `settings` directly (mutable global config) - `campaign_scheduler_service` → `campaign_service` There is no dependency injection between services — everything is direct imports. ### 2.6. Service ↔ Framework — ✅ RESOLVED (was HIGH COUPLING) ~~Domain services import `HTTPException` from FastAPI.~~ **Update (Feb 18):** `test_workflow_service.py` now raises domain exceptions (`InvalidOperationError`, `InvalidStateTransition`) from `app.domain.exceptions`. The `middleware/error_handler.py` maps these to HTTP responses automatically. Services no longer import `HTTPException`. ```python # Current: domain/errors.py exceptions mapped by middleware raise InvalidStateTransition(current_state=..., target_state=..., entity_type="Test") # middleware/error_handler.py → 400 Bad Request automatically ``` ### 2.7. Frontend ↔ Backend — LOW COUPLING (Correct) Communication is via REST API with aligned but independent types (`types/models.ts` vs `schemas/*.py`). The frontend uses Axios with interceptors — good decoupling. --- ## 3. Business Logic vs Infrastructure Separation ### 3.1. Diagnosis: ✅ MOSTLY RESOLVED (was INSUFFICIENT SEPARATION) **Update (Feb 19):** All major routers have been refactored to delegate to framework-agnostic services. | Aspect | Status | Detail | |--------|--------|--------| | **Workflow logic** | ✅ WELL SEPARATED | `test_workflow_service.py` encapsulates the state machine with domain exceptions | | **Scoring** | ✅ WELL SEPARATED | `scoring_service.py` reads weights from DB via `scoring_config_service.py` (no more mutable global state) | | **Test CRUD** | ✅ SEPARATED | `test_crud_service.py` handles all CRUD, validation, and permission checks with domain exceptions | | **Report generation** | ✅ SEPARATED | `coverage_report_service.py` handles query aggregation and CSV building (N+1 fixed) | | **Metrics** | ✅ SEPARATED | `metrics_query_service.py` handles dashboard aggregation queries | | **Compliance** | ✅ SEPARATED | `compliance_service.py` handles framework analysis and gap detection | | **Detection rules** | ✅ SEPARATED | `detection_rule_service.py` handles queries, auto-association, and evaluation | | **Threat actors** | ✅ SEPARATED | `threat_actor_service.py` handles queries, coverage, and gap analysis (N+1 fixed) | | **Evidence** | ✅ SEPARATED | `evidence_service.py` handles permission validation and queries with domain exceptions | | **Campaigns** | ✅ SEPARATED | `campaign_crud_service.py` handles CRUD, lifecycle, and scheduling | | **Heatmap/visualization** | PARTIAL | `heatmap_service.py` exists but router still has some logic | | **Data import** | WELL SEPARATED | The 8 import services are correctly isolated | | **Notifications** | WELL SEPARATED | `notification_service.py` encapsulates all logic | | **Auditing** | WELL SEPARATED | `audit_service.py` is a pure `log_action()` function | ### 3.2. Anemic Model (Anti-pattern) SQLAlchemy models are purely declarative — they have no business methods: ```python # models/test.py — columns only, zero behavior class Test(Base): __tablename__ = "tests" id = Column(UUID, primary_key=True) state = Column(Enum(TestState)) # ... more columns # Missing: can_transition(), validate(), calculate_score() ``` Logic that should be in domain models (business validations, state transitions, calculations) is scattered across routers and services. ### 3.3. Infrastructure Bleeding Into Logic | Infrastructure | Where It Appears Inappropriately | |---------------|--------------------------------| | `SQLAlchemy Session` | Inside domain services (scoring, workflow, notifications) | | `FastAPI HTTPException` | Inside domain services (test_workflow_service) | | `MinIO/boto3` | `storage.py` is well isolated, but called from routers directly | | `APScheduler` | Directly coupled in `jobs/mitre_sync_job.py` with `SessionLocal()` | --- ## 4. SOLID Evaluation ### 4.1. Single Responsibility Principle (SRP) — ✅ MOSTLY COMPLIANT (was PARTIAL VIOLATION) **Update (Feb 19):** Fat routers have been slimmed. Each router is now a thin HTTP adapter. | Component | Compliant? | Detail | |-----------|-----------|-------| | `heatmap.py` (router) | PARTIAL | Still has some inline logic; `heatmap_service` exists but not fully extracted | | `reports.py` (router) | ✅ YES | Thin adapter → `coverage_report_service` | | `tests.py` (router) | ✅ YES | Thin adapter → `test_crud_service` + `test_workflow_service` | | `campaigns.py` (router) | ✅ YES | Thin adapter → `campaign_crud_service` | | `evidence.py` (router) | ✅ YES | Thin adapter → `evidence_service` | | `scoring_service.py` | ✅ YES | Reads weights from `scoring_config_service` (DB-backed, not mutable settings) | | `test_workflow_service.py` | ✅ YES | Single responsibility: test state machine | | `notification_service.py` | ✅ YES | Single responsibility: notification management | | `audit_service.py` | ✅ YES | Single responsibility: audit logging | **Verdict:** All major routers now comply with SRP. Only `heatmap.py` and a few minor routers have remaining inline logic. ### 4.2. Open/Closed Principle (OCP) — ✅ PARTIALLY RESOLVED (was VIOLATION) **Update (Feb 19):** - **Scoring weights:** ✅ Resolved — Weights are now persisted in the `scoring_config` DB table via `scoring_config_service.py`. The `ScoringWeights` value object validates invariants (sum = 100, non-negative). No more mutable global `settings`. - **Heatmap layers:** Each heatmap type is a separate endpoint with hardcoded logic. Adding a new layer type requires modifying the router. - **Import services:** Each data source is a separate service without a common interface. Adding a new source requires creating a new service AND modifying `data_sources.py` and `system.py`. - **Test states:** The state machine is well defined in `VALID_TRANSITIONS`, but adding a new state requires modifying the dictionary AND potentially all services that read `TestState`. ### 4.3. Liskov Substitution Principle (LSP) — N/A (Partial) There is no significant inheritance or polymorphism in the backend. Services are functions, not classes. There are no interfaces or abstract classes. **Does not directly apply**, but the absence of formal contracts (protocols/ABCs) is a symptom of not being designed for extensibility. ### 4.4. Interface Segregation Principle (ISP) — ✅ PARTIALLY RESOLVED (was VIOLATION) **Update (Feb 19):** - ✅ Protocol interfaces exist for `TechniqueRepository` and `TestRepository` in `domain/ports/repositories/`. - Services expose focused functions per module (e.g., `threat_actor_service` exposes 4 functions, each for one use case). - The `Settings` object is still monolithic but scoring weights have been extracted to a dedicated DB table with a focused service interface. ### 4.5. Dependency Inversion Principle (DIP) — ✅ PARTIALLY RESOLVED (was SEVERE VIOLATION) **Update (Feb 18):** Protocol interfaces and abstractions now exist: ```python # domain/ports/repositories/ — Protocol interfaces class TechniqueRepository(Protocol): def find_by_id(self, technique_id: UUID) -> TechniqueEntity | None: ... def save(self, technique: TechniqueEntity) -> TechniqueEntity: ... # dependencies/repositories.py — FastAPI Depends() wiring def get_technique_repository(db=Depends(get_db)) -> SATechniqueRepository: ... ``` - **Domain layer** has zero framework imports (no FastAPI, no SQLAlchemy). - **Repository ports** define contracts; infrastructure implements them. - `test_workflow_service.py` now uses domain exceptions instead of `HTTPException`. - `UnitOfWork` manages transactions. **Remaining:** Some services still use direct imports for `audit_service`, `notification_service`. Full DIP adoption is incremental. --- ## 5. Architectural Risks ### 5.1. ✅ RESOLVED: God Routers (was CRITICAL RISK) **Update (Feb 19):** All critical "fat routers" have been refactored to thin HTTP adapters: | Router | Before | After | Service | |--------|--------|-------|---------| | `tests.py` | 664 lines | ~300 lines (workflow endpoints unchanged) | `test_crud_service.py` | | `campaigns.py` | ~400+ lines | ~200 lines | `campaign_crud_service.py` | | `reports.py` | 273 lines | ~100 lines | `coverage_report_service.py` | | `compliance.py` | ~350+ lines | ~100 lines | `compliance_service.py` | | `metrics.py` | ~250 lines | ~80 lines | `metrics_query_service.py` | | `detection_rules.py` | 374 lines | ~130 lines | `detection_rule_service.py` | | `threat_actors.py` | 312 lines | ~100 lines | `threat_actor_service.py` | | `evidence.py` | 367 lines | ~200 lines | `evidence_service.py` | **Remaining:** `heatmap.py` still has inline logic (~528 lines). Lower priority since it's already partially extracted to `heatmap_service`. ### 5.2. ~~CRITICAL RISK: In-Memory Token Blacklist~~ ✅ RESOLVED **Update (Feb 18):** The token blacklist is now Redis-backed via `infrastructure/redis_client.py`. Tokens are stored with TTL matching expiration. Shared across all workers and survives restarts. ### 5.3. ✅ RESOLVED: Mutable Settings at Runtime (was HIGH RISK) **Update (Feb 19):** Scoring weights are now persisted in the `scoring_config` database table via `scoring_config_service.py`. The `PATCH /scores/config` endpoint writes to the DB instead of mutating the `settings` object. The `ScoringWeights` value object validates that weights sum to 100 and are non-negative. ```python # scoring_config_service.py — DB-backed, validated, persistent def update_scoring_weights(db: Session, *, tests=None, ...) -> dict: new = ScoringWeights(tests=..., ...) # validates invariants row = db.query(ScoringConfig).first() ... db.commit() ``` - ✅ Changes survive restarts (persisted in DB) - ✅ Thread-safe (DB transactions) - ✅ Validated via `ScoringWeights` value object - Falls back to env-var defaults when no DB row exists ### 5.4. ~~HIGH RISK: No Repository Layer~~ ✅ PARTIALLY RESOLVED **Update (Feb 18):** Repository ports and implementations now exist: - `domain/ports/repositories/` — Protocol interfaces for `TechniqueRepository` and `TestRepository`. - `infrastructure/persistence/repositories/` — SQLAlchemy implementations (`SATechniqueRepository`, `SATestRepository`) with batch query methods. - `dependencies/repositories.py` — FastAPI `Depends()` wiring. **Remaining:** Old routers still use direct `db.query()`. Migration is incremental — new endpoints use repositories, old ones coexist. ### 5.5. ~~HIGH RISK: No CI/CD~~ ✅ RESOLVED **Update (Feb 18):** GitHub Actions CI pipeline exists at `.github/workflows/ci.yml`: - Runs `ruff` lint + `pytest` on every push/PR. - Uses PostgreSQL + Redis service containers (production-like environment). - Local validation via `scripts/agent_validate_backend.sh`. ### 5.6. MEDIUM RISK: Background Jobs with Own Sessions (partially mitigated) ```python # mitre_sync_job.py db = SessionLocal() try: sync_mitre(db) finally: db.close() ``` Background jobs create sessions outside the request lifecycle. This is technically correct, but: - No robust error handling (no retry mechanism). - ✅ Structured JSON logging now available (`logging_config.py`) - No dead letter queue for failed jobs. ### 5.7. ~~MEDIUM RISK: Anemic Models~~ ✅ PARTIALLY RESOLVED **Update (Feb 18):** Rich domain entities now exist alongside ORM models: - `domain/test_entity.py` — Full state machine with business logic, domain events, dual validation, timers. - `domain/entities/technique.py` — Status recalculation, review lifecycle, MITRE ID validation. - `domain/value_objects/` — `MitreId`, `ScoringWeights` (immutable, validated). - ORM models remain anemic by design (persistence mapping only). Business logic lives in domain entities. **Remaining:** Campaign, ComplianceFramework, ThreatActor still lack domain entity counterparts. ### 5.8. ~~MEDIUM RISK: No Explicit Transaction Management~~ ✅ PARTIALLY RESOLVED **Update (Feb 18):** A `UnitOfWork` context manager exists at `domain/unit_of_work.py` with explicit `commit()`, `rollback()`, and `flush()`. Used by `test_workflow_service.py` which explicitly states "The caller (router) is responsible for committing the session via the Unit of Work pattern." **Remaining:** Some services like `audit_service.py` still call `db.commit()` directly. Needs incremental migration. ### 5.9. LOW RISK: No Semantic API Versioning The API is under `/api/v1` but there is no mechanism to support v2 without duplicating entire routers. --- ## 6. Refactor Proposal Towards Clean Architecture ### 6.1. Target Structure ``` backend/ ├── app/ │ ├── main.py # FastAPI setup (minimal) │ ├── config.py # Settings (immutable) │ │ │ ├── domain/ # ★ DOMAIN LAYER (no external dependencies) │ │ ├── entities/ # Entities with behavior │ │ │ ├── test.py # Test entity with can_transition(), validate() │ │ │ ├── technique.py # Technique with calculate_status() │ │ │ ├── campaign.py # Campaign with add_test(), activate() │ │ │ └── ... │ │ ├── value_objects/ # Immutable value objects │ │ │ ├── score.py # TechniqueScore, OrganizationScore │ │ │ ├── test_state.py # TestState with valid transitions │ │ │ └── mitre_id.py # MitreId with validation │ │ ├── exceptions.py # Domain exceptions (NOT HTTPException) │ │ │ # InvalidTransitionError, EntityNotFoundError, etc. │ │ ├── events.py # Domain events │ │ │ # TestValidated, TestRejected, CampaignCompleted │ │ └── ports/ # ★ INTERFACES (ABCs / Protocols) │ │ ├── repositories/ │ │ │ ├── test_repository.py # ABC: find_by_id(), save(), list_by_technique() │ │ │ ├── technique_repository.py │ │ │ ├── campaign_repository.py │ │ │ └── ... │ │ ├── services/ │ │ │ ├── storage_port.py # ABC: upload_file(), get_presigned_url() │ │ │ ├── notification_port.py # ABC: send_notification() │ │ │ └── event_bus_port.py # ABC: publish(event) │ │ └── auth/ │ │ └── token_service_port.py │ │ │ ├── application/ # ★ APPLICATION LAYER (use cases) │ │ ├── use_cases/ │ │ │ ├── tests/ │ │ │ │ ├── create_test.py # CreateTestUseCase │ │ │ │ ├── start_execution.py # StartExecutionUseCase │ │ │ │ ├── submit_red.py │ │ │ │ ├── validate_test.py │ │ │ │ └── get_retest_chain.py │ │ │ ├── scoring/ │ │ │ │ ├── calculate_technique_score.py │ │ │ │ └── calculate_organization_score.py │ │ │ ├── campaigns/ │ │ │ │ ├── create_campaign.py │ │ │ │ └── generate_from_threat_actor.py │ │ │ ├── heatmap/ │ │ │ │ ├── generate_coverage_layer.py │ │ │ │ └── export_navigator.py │ │ │ └── reports/ │ │ │ ├── generate_coverage_report.py │ │ │ └── export_coverage_csv.py │ │ ├── dto/ # Input/Output DTOs for use cases │ │ │ ├── test_dto.py │ │ │ └── ... │ │ └── interfaces/ # Application-level ports │ │ └── unit_of_work.py # ABC: UnitOfWork with commit/rollback │ │ │ ├── infrastructure/ # ★ INFRASTRUCTURE LAYER (implementations) │ │ ├── persistence/ │ │ │ ├── orm/ # SQLAlchemy models (mapping only) │ │ │ │ ├── test_model.py │ │ │ │ ├── technique_model.py │ │ │ │ └── ... │ │ │ ├── repositories/ # Concrete implementations │ │ │ │ ├── sqlalchemy_test_repository.py │ │ │ │ ├── sqlalchemy_technique_repository.py │ │ │ │ └── ... │ │ │ ├── unit_of_work.py # SQLAlchemy UoW implementation │ │ │ └── database.py # Engine, session factory │ │ ├── storage/ │ │ │ └── minio_storage.py # Implements StoragePort │ │ ├── external/ # Import services │ │ │ ├── mitre_sync.py │ │ │ ├── atomic_import.py │ │ │ ├── sigma_import.py │ │ │ └── ... │ │ ├── auth/ │ │ │ ├── jwt_service.py # Implements TokenServicePort │ │ │ └── token_blacklist.py # Redis-backed blacklist │ │ ├── notifications/ │ │ │ └── db_notification_service.py │ │ ├── jobs/ │ │ │ └── scheduler.py # APScheduler setup │ │ └── cache/ │ │ └── redis_cache.py # Score caching (Redis) │ │ │ └── presentation/ # ★ PRESENTATION LAYER (HTTP) │ ├── api/ │ │ ├── v1/ │ │ │ ├── tests.py # Routing + request/response mapping only │ │ │ ├── techniques.py │ │ │ ├── heatmap.py │ │ │ └── ... │ │ └── dependencies.py # FastAPI Depends() wiring │ ├── schemas/ # Pydantic schemas (request/response) │ │ ├── test_schema.py │ │ └── ... │ ├── middleware/ │ │ ├── error_handler.py # Domain exceptions → HTTP responses │ │ └── rate_limiter.py │ └── mappers/ # Entity ↔ Schema mappers │ ├── test_mapper.py │ └── ... ``` ### 6.2. Dependency Rules ``` Presentation → Application → Domain ← Infrastructure ↓ ↓ ↑ ↑ FastAPI Use Cases Entities SQLAlchemy Pydantic DTOs Ports MinIO Redis APScheduler ``` **The golden rule:** Dependencies only point towards the center (Domain). Infrastructure implements the ports defined in Domain. ### 6.3. Key Changes by Layer #### Domain Layer (New) ```python # domain/entities/test.py — Rich entity (not anemic) class TestEntity: def __init__(self, id, state, technique_id, ...): self._state = state def can_transition_to(self, target: TestState) -> bool: return target in VALID_TRANSITIONS[self._state] def start_execution(self, user: UserEntity) -> list[DomainEvent]: if not self.can_transition_to(TestState.red_executing): raise InvalidTransitionError(self._state, TestState.red_executing) self._state = TestState.red_executing return [TestExecutionStarted(test_id=self.id, user_id=user.id)] # domain/exceptions.py — Domain exceptions, NOT HTTPException class InvalidTransitionError(DomainException): def __init__(self, current: TestState, target: TestState): self.current = current self.target = target # domain/ports/repositories/test_repository.py — Abstract interface class TestRepository(Protocol): def find_by_id(self, test_id: UUID) -> TestEntity | None: ... def save(self, test: TestEntity) -> None: ... def list_by_technique(self, technique_id: UUID) -> list[TestEntity]: ... ``` #### Application Layer (Use Cases) ```python # application/use_cases/tests/start_execution.py class StartExecutionUseCase: def __init__(self, test_repo: TestRepository, uow: UnitOfWork): self._test_repo = test_repo self._uow = uow def execute(self, test_id: UUID, user_id: UUID) -> TestDTO: with self._uow: test = self._test_repo.find_by_id(test_id) if not test: raise EntityNotFoundError("Test", test_id) events = test.start_execution(user) self._test_repo.save(test) self._uow.commit() # events are published after commit return TestDTO.from_entity(test) ``` #### Presentation Layer (Slim Routers) ```python # presentation/api/v1/tests.py — HTTP concerns only @router.post("/{test_id}/start-execution") def start_execution( test_id: UUID, use_case: StartExecutionUseCase = Depends(get_start_execution_use_case), current_user: User = Depends(get_current_user), ): try: result = use_case.execute(test_id, current_user.id) return result except EntityNotFoundError: raise HTTPException(404) except InvalidTransitionError as e: raise HTTPException(400, detail=str(e)) ``` #### Infrastructure Layer (Implementations) ```python # infrastructure/persistence/repositories/sqlalchemy_test_repository.py class SQLAlchemyTestRepository(TestRepository): def __init__(self, session: Session): self._session = session def find_by_id(self, test_id: UUID) -> TestEntity | None: model = self._session.query(TestModel).filter(TestModel.id == test_id).first() return TestMapper.to_entity(model) if model else None def save(self, test: TestEntity) -> None: model = TestMapper.to_model(test) self._session.merge(model) ``` ### 6.4. Incremental Migration Plan (Phases) **The refactor must be incremental — not big bang.** Each phase delivers value and the system continues working. #### Phase 1: Foundations (1-2 weeks) 1. Create the directory structure: `domain/`, `application/`, `infrastructure/`, `presentation/`. 2. Create `domain/exceptions.py` with domain exceptions. 3. Create `error_handler.py` middleware that maps domain exceptions → HTTP responses. 4. Create `domain/ports/repositories/` with Protocol interfaces for the 3-4 most used entities (Test, Technique, Campaign). 5. Create SQLAlchemy implementations of these repositories. 6. **Do not move routers yet.** #### Phase 2: Extract the Test Domain (1-2 weeks) 1. Create `domain/entities/test.py` with the state machine (extract from `test_workflow_service`). 2. Create use cases for each state transition. 3. Migrate the `tests.py` router to use the use cases. 4. Remove `HTTPException` from `test_workflow_service`. 5. **Pure unit tests** for the domain entity (no DB). #### Phase 3: Extract Fat Services from Routers (2-3 weeks) 1. Move `heatmap.py` logic to `application/use_cases/heatmap/`. 2. Move `reports.py` logic to `application/use_cases/reports/`. 3. Move `metrics.py` logic to application services. 4. Routers become thin controllers (< 20 lines per endpoint). #### Phase 4: Complete Repository Pattern (1-2 weeks) 1. Create repositories for all remaining entities. 2. Migrate scattered queries from routers to repositories. 3. Remove `db.query(...)` from any file outside `infrastructure/`. #### Phase 5: Robust Infrastructure (1-2 weeks) 1. Move token blacklist to Redis. 2. Implement the Unit of Work pattern. 3. Move scoring config to the database (not mutable `settings`). 4. Add event bus for domain events (notifications, auditing). #### Phase 6: CI/CD and Observability 1. Set up GitHub Actions (lint, type check, tests). 2. Add structured logging. 3. Add improved health checks. --- ## 7. Executive Summary ### Current Strengths | Strength | Detail | |----------|--------| | Well-modeled domain | The data model covers ATT&CK, D3FEND, compliance, threat actors, and campaigns comprehensively | | Solid test workflow | The state machine in `test_workflow_service` is the best designed component | | Clean frontend | API/pages/components separation with TanStack Query is correct | | Secure auth | HttpOnly cookies + RBAC with 6 well-defined roles | | Import services | The 8 import services are well encapsulated | | Existing tests | 18 test files with fixtures — a foundation to build upon | ### Critical Weaknesses (Updated Feb 19) | Weakness | Original Severity | Current Status | |----------|----------|--------| | Fat controllers (routers with business logic) | HIGH | ✅ Resolved — 9 routers extracted to services | | No repository layer | HIGH | ✅ Resolved (Test, Technique repos + 9 service modules) | | Services depend on FastAPI | HIGH | ✅ Resolved (domain exceptions + middleware) | | Anemic models | MEDIUM | ✅ Partially resolved (TestEntity, TechniqueEntity) | | In-memory token blacklist | HIGH | ✅ Resolved (Redis-backed) | | Mutable settings at runtime | MEDIUM | ✅ Resolved (scoring_config DB table) | | No CI/CD | MEDIUM | ✅ Resolved (GitHub Actions) | | No dependency inversion | HIGH | ✅ Partially resolved (ports + repos + services) | | No structured logging | LOW | ✅ Resolved (JSON logging for production) | ### Final Classification ``` ┌──────────────────────────────────────────────────────────┐ │ Type: Clean Modular Monolith │ │ Maturity: Production-ready │ │ SOLID: 4/5 (SRP ✅, OCP partial, LSP n/a, │ │ ISP partial, DIP ✅ started) │ │ Testability: 7/10 (326 tests, domain unit tests, repo │ │ integration tests, service layer tests) │ │ Coupling: 7/10 (domain decoupled, services agnostic, │ │ most routers are thin adapters) │ │ Cohesion: 8/10 (domain entities own business rules, │ │ services own query logic) │ │ Estimated remaining tech debt: ~1 week │ │ (heatmap extraction, remaining minor routers, │ │ Campaign/ComplianceFramework domain entities) │ └──────────────────────────────────────────────────────────┘ ``` ### Recommendation (Updated Feb 19) The architectural refactoring is substantially complete. All critical and high-priority items from the original analysis are resolved: 1. ~~Extract domain exceptions~~ ✅ Done 2. ~~Create repositories for Test and Technique~~ ✅ Done 3. ~~Move token blacklist to Redis~~ ✅ Done 4. ~~Set up basic CI/CD~~ ✅ Done 5. ~~Migrate fat routers to services~~ ✅ Done (9 routers extracted) 6. ~~Persist scoring weights in database~~ ✅ Done 7. ~~Add structured JSON logging~~ ✅ Done **Remaining low-priority items:** 1. Extract remaining logic from `heatmap.py` to `heatmap_service.py` 2. Create domain entities for Campaign and ComplianceFramework 3. Extract `users.py`, `audit.py`, `data_sources.py` to services (simple CRUD) 4. Add common interface for import services (OCP improvement)