Files
Aegis/docs/TARGET_ARCHITECTURE.md
Kitos 0b65f51d1c
Some checks failed
Aegis CI / lint-and-test (push) Has been cancelled
docs: update architecture analysis and tech debt docs to reflect resolved items
2026-02-18 19:27:52 +01:00

954 lines
51 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Aegis — Target Architecture: Clean Modular Monolith
> **Author:** Architecture review
> **Date:** February 11, 2026 (updated February 18, 2026)
> **Status:** In Progress — foundational layers implemented
> **Depends on:** ARCHITECTURAL_ANALYSIS.md, DEPENDENCY_ANALYSIS.md, TECH_DEBT_AND_RISKS.md
>
> **Implementation Progress (Feb 18, 2026):**
> - ✅ Domain exceptions hierarchy (`domain/errors.py`, `domain/exceptions.py`)
> - ✅ Error handler middleware (`middleware/error_handler.py`)
> - ✅ TestEntity with full state machine (`domain/test_entity.py`)
> - ✅ TechniqueEntity with status recalculation (`domain/entities/technique.py`)
> - ✅ Value objects: MitreId, ScoringWeights (`domain/value_objects/`)
> - ✅ Repository ports/protocols (`domain/ports/repositories/`)
> - ✅ SQLAlchemy repository implementations (`infrastructure/persistence/repositories/`)
> - ✅ ORM-Entity mappers (`infrastructure/persistence/mappers/`)
> - ✅ FastAPI dependency wiring (`dependencies/repositories.py`)
> - ✅ Unit of Work (`domain/unit_of_work.py`)
> - ✅ Redis-backed token blacklist (`infrastructure/redis_client.py`)
> - ✅ CI pipeline (`.github/workflows/ci.yml`)
> - ✅ 326 tests passing (domain unit tests + integration tests + API tests)
> - ✅ Architecture rules file (`.cursor/rules/aegis-architecture.md`)
>
> **Remaining:** Application layer use cases, Campaign/Compliance domain entities, router migration to repositories, scoring config persistence, structured logging.
---
## Table of Contents
1. [Target Architecture Overview](#1-target-architecture-overview)
2. [Layer Definitions and Responsibilities](#2-layer-definitions-and-responsibilities)
3. [Module Boundaries](#3-module-boundaries)
4. [Dependency Rules](#4-dependency-rules)
5. [Top 5 Modules to Refactor First](#5-top-5-modules-to-refactor-first)
6. [Repository Pattern for Technique](#6-repository-pattern-for-technique)
---
## 1. Target Architecture Overview
### Design Philosophy
The target architecture applies Clean Architecture principles to a modular monolith. This is not a microservices migration — it is an internal reorganization of the existing codebase to enforce separation of concerns, dependency inversion, and testability while maintaining a single deployable unit.
### Target Directory Structure
```
backend/
└── app/
├── main.py # FastAPI app bootstrap (minimal)
├── config.py # Pydantic Settings (read-only)
├── domain/ # ★ DOMAIN LAYER
│ ├── __init__.py
│ │
│ ├── enums.py # TechniqueStatus, TestState, TeamSide, TestResult
│ │ # (moved from models/enums.py — these are domain concepts)
│ │
│ ├── exceptions.py # Domain exception hierarchy
│ │ # EntityNotFoundError
│ │ # DuplicateEntityError
│ │ # InvalidTransitionError
│ │ # InvalidOperationError
│ │ # AuthorizationError
│ │
│ ├── events.py # Domain event definitions (data classes)
│ │ # TestStateChanged, TechniqueStatusRecalculated,
│ │ # CampaignCompleted, EvidenceUploaded
│ │
│ ├── entities/ # Rich domain entities with behavior
│ │ ├── __init__.py
│ │ ├── technique.py # TechniqueEntity: recalculate_status(), mark_reviewed()
│ │ ├── test.py # TestEntity: can_transition(), start_execution(),
│ │ │ # submit_red(), submit_blue(), validate(), reopen()
│ │ ├── campaign.py # CampaignEntity: add_test(), remove_test(), activate(),
│ │ │ # complete(), has_circular_dependency()
│ │ ├── user.py # UserEntity: has_role(), can_access()
│ │ ├── detection_rule.py # DetectionRuleEntity
│ │ ├── threat_actor.py # ThreatActorEntity
│ │ └── evidence.py # EvidenceEntity: validate_upload_permission()
│ │
│ ├── value_objects/ # Immutable, equality-by-value
│ │ ├── __init__.py
│ │ ├── mitre_id.py # MitreId: validated format (T1059, T1059.001)
│ │ ├── score.py # TechniqueScore, TacticScore, OrgScore (with breakdown)
│ │ └── scoring_weights.py # ScoringWeights: validated weight set (sum == 100)
│ │
│ └── ports/ # ★ INTERFACES — the contracts
│ ├── __init__.py
│ ├── repositories/ # Data access contracts (one per aggregate root)
│ │ ├── __init__.py
│ │ ├── technique_repository.py # TechniqueRepository protocol
│ │ ├── test_repository.py # TestRepository protocol
│ │ ├── campaign_repository.py # CampaignRepository protocol
│ │ ├── user_repository.py # UserRepository protocol
│ │ ├── detection_rule_repository.py
│ │ ├── threat_actor_repository.py
│ │ ├── evidence_repository.py
│ │ ├── audit_repository.py
│ │ ├── notification_repository.py
│ │ └── snapshot_repository.py
│ │
│ └── services/ # External capability contracts
│ ├── __init__.py
│ ├── storage_port.py # StoragePort: upload_file(), get_download_url()
│ ├── event_publisher_port.py # EventPublisherPort: publish(DomainEvent)
│ └── token_blacklist_port.py # TokenBlacklistPort: revoke(), is_revoked()
├── application/ # ★ APPLICATION LAYER
│ ├── __init__.py
│ │
│ ├── interfaces/ # Application-level contracts
│ │ ├── __init__.py
│ │ └── unit_of_work.py # UnitOfWork protocol: commit(), rollback(), __enter__/__exit__
│ │
│ ├── dto/ # Input/output data structures for use cases
│ │ ├── __init__.py # Pure data classes — no ORM, no Pydantic
│ │ ├── technique_dto.py # TechniqueListFilters, TechniqueResult, TechniqueDetail
│ │ ├── test_dto.py # CreateTestInput, TestResult, TestTimeline
│ │ ├── scoring_dto.py # ScoreRequest, ScoreResult, ScoreHistoryResult
│ │ ├── heatmap_dto.py # HeatmapFilters, HeatmapLayer, NavigatorExport
│ │ ├── report_dto.py # CoverageReportResult, CsvExportResult
│ │ └── campaign_dto.py # CreateCampaignInput, CampaignProgress
│ │
│ └── use_cases/ # Orchestrators — one class per operation
│ ├── __init__.py
│ │
│ ├── techniques/
│ │ ├── list_techniques.py # ListTechniquesUseCase
│ │ ├── get_technique.py # GetTechniqueUseCase
│ │ ├── create_technique.py # CreateTechniqueUseCase
│ │ ├── update_technique.py # UpdateTechniqueUseCase
│ │ └── review_technique.py # ReviewTechniqueUseCase
│ │
│ ├── tests/
│ │ ├── create_test.py # CreateTestUseCase
│ │ ├── create_from_template.py # CreateFromTemplateUseCase
│ │ ├── start_execution.py # StartExecutionUseCase
│ │ ├── submit_red.py # SubmitRedUseCase
│ │ ├── submit_blue.py # SubmitBlueUseCase
│ │ ├── validate_test.py # ValidateTestUseCase
│ │ ├── reopen_test.py # ReopenTestUseCase
│ │ └── get_retest_chain.py # GetRetestChainUseCase
│ │
│ ├── scoring/
│ │ ├── calculate_technique_score.py
│ │ ├── calculate_tactic_score.py
│ │ ├── calculate_org_score.py
│ │ └── update_scoring_weights.py
│ │
│ ├── heatmap/
│ │ ├── generate_coverage_layer.py
│ │ ├── generate_actor_layer.py
│ │ ├── generate_detection_layer.py
│ │ └── export_navigator.py
│ │
│ ├── reports/
│ │ ├── generate_coverage_report.py
│ │ ├── generate_test_results_report.py
│ │ ├── generate_remediation_report.py
│ │ └── export_coverage_csv.py
│ │
│ └── campaigns/
│ ├── create_campaign.py
│ ├── manage_campaign_tests.py
│ ├── activate_campaign.py
│ ├── generate_from_threat_actor.py
│ └── schedule_recurring.py
├── infrastructure/ # ★ INFRASTRUCTURE LAYER
│ ├── __init__.py
│ │
│ ├── persistence/
│ │ ├── __init__.py
│ │ ├── database.py # Engine, SessionLocal, get_db() — unchanged
│ │ │
│ │ ├── orm/ # SQLAlchemy models (table mapping ONLY)
│ │ │ ├── __init__.py # Re-export all models for Alembic
│ │ │ ├── base.py # declarative_base()
│ │ │ ├── technique_model.py # Current models/technique.py — unchanged
│ │ │ ├── test_model.py # Current models/test.py — unchanged
│ │ │ ├── campaign_model.py
│ │ │ ├── user_model.py
│ │ │ └── ... # All 18 current models, untouched
│ │ │
│ │ ├── repositories/ # Concrete repository implementations
│ │ │ ├── __init__.py
│ │ │ ├── sa_technique_repository.py
│ │ │ ├── sa_test_repository.py
│ │ │ ├── sa_campaign_repository.py
│ │ │ └── ... # One per domain port
│ │ │
│ │ ├── unit_of_work.py # SQLAlchemy UoW (wraps Session commit/rollback)
│ │ │
│ │ └── mappers/ # ORM Model ↔ Domain Entity converters
│ │ ├── __init__.py
│ │ ├── technique_mapper.py # to_entity(model) → TechniqueEntity
│ │ │ # to_model(entity) → TechniqueORM
│ │ ├── test_mapper.py
│ │ └── ...
│ │
│ ├── storage/
│ │ └── minio_storage.py # Implements StoragePort (current storage.py logic)
│ │
│ ├── auth/
│ │ ├── jwt_service.py # Token creation and verification
│ │ └── redis_token_blacklist.py # Implements TokenBlacklistPort
│ │
│ ├── external/ # External data source adapters
│ │ ├── mitre_taxii_adapter.py # Current mitre_sync_service.py
│ │ ├── atomic_red_team_adapter.py # Current atomic_import_service.py
│ │ ├── sigma_adapter.py
│ │ ├── elastic_adapter.py
│ │ ├── caldera_adapter.py
│ │ ├── d3fend_adapter.py
│ │ ├── lolbas_adapter.py
│ │ └── threat_actor_adapter.py
│ │
│ ├── events/
│ │ └── sync_event_publisher.py # Implements EventPublisherPort (in-process dispatch)
│ │
│ ├── cache/
│ │ └── redis_score_cache.py # Replaces current in-memory score_cache.py
│ │
│ └── jobs/
│ └── scheduler.py # APScheduler setup (current mitre_sync_job.py)
└── presentation/ # ★ PRESENTATION LAYER
├── __init__.py
├── api/
│ └── v1/ # Thin routers — HTTP mapping only
│ ├── __init__.py
│ ├── techniques.py # Injects use case via Depends(), maps exceptions
│ ├── tests.py
│ ├── campaigns.py
│ ├── heatmap.py
│ ├── reports.py
│ ├── scores.py
│ ├── metrics.py
│ └── ... # All 21 current routers, thinned
├── schemas/ # Pydantic models (request/response shapes)
│ ├── __init__.py # Current schemas/ — unchanged
│ ├── technique_schema.py
│ ├── test_schema.py
│ └── ...
├── dependencies/ # FastAPI Depends() wiring
│ ├── __init__.py
│ ├── auth.py # Current dependencies/auth.py
│ ├── repositories.py # get_technique_repo(), get_test_repo(), ...
│ └── use_cases.py # get_create_technique_use_case(), ...
├── middleware/
│ ├── error_handler.py # Maps domain exceptions → HTTP responses
│ └── rate_limiter.py
└── mappers/ # Pydantic schema ↔ application DTO converters
├── __init__.py
├── technique_mapper.py # TechniqueCreate → CreateTechniqueInput
│ # TechniqueResult → TechniqueOut
└── ...
```
---
## 2. Layer Definitions and Responsibilities
### Domain Layer — The Core
```
Depends on: NOTHING (zero imports from outside domain/)
```
| Component | Responsibility | What It Must NOT Do |
|-----------|---------------|---------------------|
| **Entities** | Encapsulate business rules, invariants, and state transitions. A `TestEntity` knows which transitions are valid. A `TechniqueEntity` can recalculate its own status from a list of test results. | Import SQLAlchemy, FastAPI, Pydantic, or any framework. Access the database. Make HTTP calls. |
| **Value Objects** | Represent domain concepts with value equality. `MitreId("T1059.001")` validates format on construction. `ScoringWeights` ensures the 5 weights sum to 100. | Be mutable. Have identity (no primary key). |
| **Enums** | Define domain vocabularies: `TechniqueStatus`, `TestState`, `TeamSide`, `TestResult`. | Change based on infrastructure (these are the same enums currently in `models/enums.py`). |
| **Exceptions** | Domain-specific error conditions. `InvalidTransitionError(current=draft, target=validated)`. | Reference HTTP status codes. Know about FastAPI. |
| **Events** | Facts about things that happened. `TestStateChanged(test_id, old_state, new_state, user_id, timestamp)`. | Carry behavior. Know how they will be handled. |
| **Ports** | Interfaces (Protocol) defining what the domain needs from the outside world. `TechniqueRepository`, `StoragePort`, `EventPublisherPort`. | Contain implementations. Reference concrete classes. |
### Application Layer — The Orchestrators
```
Depends on: domain/ only
```
| Component | Responsibility | What It Must NOT Do |
|-----------|---------------|---------------------|
| **Use Cases** | Orchestrate a single business operation by calling domain entities and ports. `CreateTechniqueUseCase` validates uniqueness via `TechniqueRepository`, constructs a `TechniqueEntity`, saves it, and publishes an event. | Know about HTTP, Pydantic, SQLAlchemy, or FastAPI. Contain business rules (those belong in entities). Contain queries (those belong in repositories). |
| **DTOs** | Plain data containers for use case input/output. No validation logic, no ORM awareness. | Inherit from Pydantic `BaseModel`. Reference ORM models. |
| **Unit of Work** | Interface for transaction boundaries. Use cases call `uow.commit()` or `uow.rollback()`. | Know about SQLAlchemy sessions. |
### Infrastructure Layer — The Implementations
```
Depends on: domain/ (implements ports), application/ (implements UoW)
```
| Component | Responsibility | What It Must NOT Do |
|-----------|---------------|---------------------|
| **ORM Models** | Map Python classes to database tables. Unchanged from current `models/`. | Contain business logic. Be passed outside the infrastructure layer (use mappers to convert to domain entities). |
| **Repositories** | Implement port interfaces using SQLAlchemy. `SATechniqueRepository.find_by_mitre_id()` translates to `db.query(Technique).filter(...)`. | Be called by anything outside the application layer. Contain business decisions. |
| **Mappers** | Convert between ORM models and domain entities. `TechniqueMapper.to_entity(orm_model) → TechniqueEntity`. | Contain business logic. Be a 1:1 field copy (they handle relationship loading and value object construction). |
| **External Adapters** | Implement data source integrations. Download ZIPs, parse YAML/TOML/STIX, return domain-compatible data. | Be called from routers directly. Know about HTTP responses. |
| **Storage, Cache, Auth** | Implement service ports. `MinioStorage` implements `StoragePort`. `RedisTokenBlacklist` implements `TokenBlacklistPort`. | Leak implementation details (Redis keys, S3 bucket names) outside the infrastructure layer. |
### Presentation Layer — The HTTP Boundary
```
Depends on: application/ (calls use cases), domain/ (reads exceptions)
```
| Component | Responsibility | What It Must NOT Do |
|-----------|---------------|---------------------|
| **Routers** | Map HTTP requests to use case calls. Parse path/query/body parameters, call the use case, return the response. 10-20 lines per endpoint maximum. | Contain business logic. Execute database queries. Build complex data structures. |
| **Schemas** | Pydantic models for HTTP request/response validation. Unchanged from current `schemas/`. | Be used inside use cases or domain entities. |
| **Dependencies** | Wire use cases via FastAPI `Depends()`. Construct repositories, inject into use cases, return. | Contain logic beyond wiring. |
| **Error Handler** | Map domain exceptions to HTTP responses. `EntityNotFoundError → 404`, `InvalidTransitionError → 400`, `AuthorizationError → 403`. | Know about business rules. |
| **Mappers** | Convert between Pydantic schemas and application DTOs. | Contain business logic. |
---
## 3. Module Boundaries
The monolith is organized into domain modules. Each module owns its entities, repositories, and use cases. Cross-module communication goes through application-layer use cases or domain events — never through direct repository access.
```
┌─────────────────────────────────────────────────────────────────┐
│ Domain Modules │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ Technique │ │ Test │ │ Campaign │ │ Scoring │ │
│ │ │ │ │ │ │ │ │ │
│ │ entity │ │ entity │ │ entity │ │ value objs │ │
│ │ repo port │ │ repo port │ │ repo port │ │ use cases │ │
│ │ use cases │ │ use cases │ │ use cases │ │ (reads from │ │
│ │ │ │ │ │ │ │ other repos)│ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ ┌─────┴──────────────┴──────────────┴───────────────┴──────┐ │
│ │ Shared Domain: enums, exceptions, events │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ Heatmap │ │ Reports │ │Compliance │ │ Threat Intel│ │
│ │ │ │ │ │ │ │ │ │
│ │ use cases │ │ use cases │ │ use cases │ │ adapters │ │
│ │ (reads │ │ (reads │ │ (reads │ │ use cases │ │
│ │ repos) │ │ repos) │ │ repos) │ │ │ │
│ └───────────┘ └───────────┘ └───────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
**Cross-module rule:** A use case in the Scoring module may read from `TechniqueRepository` and `TestRepository` (both defined as ports in the domain layer). It must NOT import the SQLAlchemy model directly.
---
## 4. Dependency Rules
```
┌─────────────────┐
│ Presentation │ Knows: FastAPI, Pydantic, HTTP
│ (routers, │ Depends on: Application, Domain
│ schemas) │
└────────┬─────────┘
│ calls use cases
┌────────▼─────────┐
│ Application │ Knows: Domain entities, ports, DTOs
│ (use cases) │ Depends on: Domain ONLY
└────────┬─────────┘
│ uses entities + ports
┌────────▼─────────┐
│ Domain │ Knows: NOTHING external
│ (entities, │ Depends on: NOTHING
│ ports, enums) │ (this is the core)
└────────▲─────────┘
│ implements ports
┌────────┴─────────┐
│ Infrastructure │ Knows: SQLAlchemy, boto3, Redis, requests
│ (repositories, │ Depends on: Domain (ports), Application (UoW)
│ adapters) │
└──────────────────┘
```
### Import Rules (Enforceable by Linting)
| From \ To | domain/ | application/ | infrastructure/ | presentation/ |
|-----------|---------|-------------|----------------|--------------|
| **domain/** | Self only | FORBIDDEN | FORBIDDEN | FORBIDDEN |
| **application/** | ALLOWED | Self only | FORBIDDEN | FORBIDDEN |
| **infrastructure/** | ALLOWED (ports) | ALLOWED (UoW) | Self only | FORBIDDEN |
| **presentation/** | ALLOWED (exceptions) | ALLOWED (use cases, DTOs) | ALLOWED (wiring only, in dependencies/) | Self only |
---
## 5. Top 5 Modules to Refactor First
### Selection Criteria
Each module is scored on three axes from the DEPENDENCY_ANALYSIS.md findings:
| Axis | Weight | Measurement |
|------|--------|-------------|
| **Complexity** | 35% | Lines of code, number of DB operations, number of models imported, number of concerns mixed |
| **Technical Risk** | 35% | N+1 queries, security issues, silent exception swallowing, framework coupling, scalability bottleneck |
| **Business Impact** | 30% | Centrality to the domain (how many other modules depend on it), user-facing frequency, correctness criticality |
---
### #1: Test Workflow Module
**Refactor scope:** `routers/tests.py` (664 lines, 30 db ops) + `services/test_workflow_service.py` (456 lines, 13 db ops) + `services/status_service.py` (47 lines)
| Axis | Score | Evidence |
|------|-------|----------|
| Complexity | **10/10** | 664-line router with 15+ endpoints. Mixes CRUD, template instantiation, timeline queries, and workflow delegation. The workflow service itself is 456 lines with a state machine, notifications, and audit logging. |
| Technical Risk | **10/10** | `test_workflow_service` imports `FastAPI.HTTPException` — the most severe framework coupling in the codebase. 4 `except Exception: pass` blocks silently swallow notification failures. No way to unit test the state machine without a database session. |
| Business Impact | **10/10** | The Red/Blue validation workflow IS the core product. Every user role interacts with tests daily. A state transition bug could invalidate an entire assessment. 5 other modules depend on test data (scoring, heatmap, reports, metrics, campaigns). |
**Why first:** This module contains the single most important business logic in Aegis (the test state machine), yet it has the most severe coupling problems (HTTPException in domain logic, swallowed exceptions). Extracting a `TestEntity` with the state machine as a domain object unlocks pure unit testing of the most critical business rules.
**What to extract:**
- `TestEntity` with `can_transition()`, `start_execution()`, `submit_red()`, `submit_blue()`, `validate()`, `reopen()``domain/entities/test.py`
- `InvalidTransitionError`, `EntityNotFoundError``domain/exceptions.py`
- `TestRepository` protocol → `domain/ports/repositories/test_repository.py`
- One use case per state transition → `application/use_cases/tests/`
- Remove all `HTTPException` from services
- Replace `except Exception: pass` with event-based notification dispatch
---
### #2: Scoring Module
**Refactor scope:** `services/scoring_service.py` (468 lines, 17 db ops) + `services/score_cache.py` + `routers/scores.py` (2 db ops) + `services/operational_metrics_service.py` (21 db ops)
| Axis | Score | Evidence |
|------|-------|----------|
| Complexity | **9/10** | Multi-dimensional scoring algorithm reading from 7 different models. 5 configurable weights. Tactic, actor, and org scores compound technique scores. Operational metrics add MTTD/MTTR calculations with audit log queries. |
| Technical Risk | **9/10** | **SR-001 from risk registry:** Org score generates ~3,500 DB queries (N+1 pattern). Settings mutated at runtime (thread-unsafe). In-memory cache does not scale across workers. Operational metrics N+1 on audit logs adds ~1,000 more queries. |
| Business Impact | **9/10** | Scores drive executive dashboards, compliance reports, and snapshot history. Incorrect scores misrepresent organizational security posture. Scoring weights mutability without persistence means config is lost on restart. |
**Why second:** Scoring is the second most critical domain concept and the most severe scalability bottleneck. Refactoring it introduces the repository pattern for batch queries and moves scoring weights to a persistent, immutable configuration.
**What to extract:**
- `TechniqueScore`, `TacticScore`, `OrgScore` value objects → `domain/value_objects/score.py`
- `ScoringWeights` value object with validation → `domain/value_objects/scoring_weights.py`
- Scoring algorithm as pure functions operating on domain objects → `application/use_cases/scoring/`
- Batch query methods in repositories → `TechniqueRepository.find_all_with_test_counts()`
- Redis-backed cache → `infrastructure/cache/`
- Persist weights in DB → `ScoringConfigRepository`
---
### #3: Heatmap Module
**Refactor scope:** `routers/heatmap.py` (528 lines, 13 db ops, 0 service delegation)
| Axis | Score | Evidence |
|------|-------|----------|
| Complexity | **9/10** | 528 lines in a single router file. Imports 10 models from 6 different domains. Mixes HTTP handling, complex multi-table queries, color mapping algorithms, ATT&CK Navigator JSON serialization, and streaming export — all in one file with zero delegation. |
| Technical Risk | **8/10** | **SR-003 from risk registry:** 1,400+ queries per request (2 per technique × 700). No caching. Full table scan. Every heatmap page load hammers the database. Most-visited view in the platform. |
| Business Impact | **8/10** | The ATT&CK heatmap is the primary visualization — it is the first thing executives see. Navigator export is used for external reporting and audit evidence. Incorrect heatmap data directly impacts security decision-making. |
**Why third:** This is the purest "fat controller" in the codebase — 528 lines of business logic, queries, and serialization with zero abstraction. It is also the most-visited page and the second-worst scalability bottleneck. Extracting it demonstrates the pattern for all other fat routers.
**What to extract:**
- Layer generation logic → `application/use_cases/heatmap/generate_coverage_layer.py` etc.
- Navigator export format → `application/use_cases/heatmap/export_navigator.py`
- Color mapping → `domain/value_objects/` or utility in application layer
- Batch metadata queries → `TechniqueRepository.find_all_with_coverage_metadata()`
- Router reduced from 528 lines to ~80 (5 endpoints × ~15 lines each)
---
### #4: Campaign Module
**Refactor scope:** `routers/campaigns.py` (36 db ops) + `services/campaign_service.py` (10 db ops, imports HTTPException) + `services/campaign_scheduler_service.py` (8 db ops)
| Axis | Score | Evidence |
|------|-------|----------|
| Complexity | **8/10** | Router has 36 db operations — the highest count of any router. Campaign lifecycle spans creation, test management, activation, completion, scheduling, and threat actor generation. Three files with partially overlapping responsibilities. |
| Technical Risk | **7/10** | `campaign_service.py` imports `HTTPException` (framework coupling). Scheduler creates campaigns in background jobs with its own session. Circular dependency detection logic is complex and untested (no campaign router tests exist). |
| Business Impact | **8/10** | Campaigns organize test execution for entire threat actor profiles. A bug in campaign scheduling or circular dependency detection could spawn infinite campaigns or skip critical test coverage. Campaigns drive the operational workflow for Red/Blue leads. |
**Why fourth:** The campaign module has the most scattered responsibilities (36 db ops in router + service + scheduler) and the second instance of HTTPException in a service. It is a natural candidate after tests, scoring, and heatmap because it depends on both test and technique entities, testing the cross-module communication pattern.
**What to extract:**
- `CampaignEntity` with `add_test()`, `activate()`, `complete()`, `has_circular_dependency()``domain/entities/campaign.py`
- `CampaignRepository` protocol → `domain/ports/repositories/`
- Use cases for lifecycle operations → `application/use_cases/campaigns/`
- Remove `HTTPException` from `campaign_service.py`
- Campaign scheduling as infrastructure concern → `infrastructure/jobs/`
---
### #5: Reports & Metrics Module
**Refactor scope:** `routers/reports.py` (273 lines, 6 db ops) + `routers/metrics.py` (316 lines, 12 db ops) + `routers/compliance.py` (~350 lines, 13 db ops)
| Axis | Score | Evidence |
|------|-------|----------|
| Complexity | **8/10** | Three routers totaling ~940 lines with zero service delegation. Complex aggregation queries, CSV generation, in-memory data transformation, and compliance gap analysis — all inline in route handlers. |
| Technical Risk | **7/10** | **SR-004 from risk registry:** Reports load unbounded result sets (all techniques, all tests). N+1 per-technique test counts in reports. In-memory aggregation instead of SQL GROUP BY. No streaming for CSV export. Compliance calls `calculate_technique_score()` per technique per control — multiplicative N+1. |
| Business Impact | **7/10** | Reports and metrics are consumed by leads and executives for decision-making. Compliance reports map to regulatory requirements (NIST 800-53, CIS Controls). Incorrect metrics erode trust in the platform. |
**Why fifth:** These three routers share the same anti-pattern (fat controller with inline queries and aggregations) and the same fix (extract to application-layer use cases with repository-backed batch queries). Refactoring them as a group establishes the pattern for the remaining 8 routers that still have direct DB access.
**What to extract:**
- Report generation → `application/use_cases/reports/`
- Metrics calculation → `application/use_cases/metrics/` (or merge with scoring)
- Compliance gap analysis → `application/use_cases/compliance/`
- SQL-level aggregation in repositories → `TechniqueRepository.get_coverage_summary()`
- CSV streaming as infrastructure concern → `infrastructure/export/csv_writer.py`
---
### Refactor Priority Summary
```
Module Complexity Risk Impact Weighted Order
─────────────────────────────────────────────────────────
Test Workflow 10 10 10 10.0 #1
Scoring 9 9 9 9.0 #2
Heatmap 9 8 8 8.4 #3
Campaigns 8 7 8 7.7 #4
Reports & Metrics 8 7 7 7.4 #5
```
---
## 6. Repository Pattern for Technique
This section designs a concrete repository pattern for `Technique` that can be introduced **without breaking existing code**. The strategy is additive: new code uses the repository, old code continues working until incrementally migrated.
### 6.1. Domain Port — The Interface
```python
# domain/ports/repositories/technique_repository.py
from __future__ import annotations
import uuid
from typing import Protocol, runtime_checkable
from app.domain.enums import TechniqueStatus
@runtime_checkable
class TechniqueRepository(Protocol):
"""Port defining how the application accesses technique data.
This is a domain contract — implementations live in infrastructure/.
The domain layer NEVER imports the implementation.
"""
# ── Single-entity access ─────────────────────────────────────
def find_by_id(self, technique_id: uuid.UUID) -> TechniqueEntity | None:
"""Return a technique by primary key, or None."""
...
def find_by_mitre_id(self, mitre_id: str) -> TechniqueEntity | None:
"""Return a technique by its MITRE ATT&CK identifier (e.g. 'T1059.001')."""
...
def find_by_mitre_id_with_tests(self, mitre_id: str) -> TechniqueEntity | None:
"""Return a technique with its tests eagerly loaded."""
...
# ── List access ──────────────────────────────────────────────
def list_all(
self,
*,
tactic: str | None = None,
status: TechniqueStatus | None = None,
review_required: bool | None = None,
) -> list[TechniqueEntity]:
"""Return techniques matching the given filters, ordered by mitre_id."""
...
def list_by_tactic(self, tactic: str) -> list[TechniqueEntity]:
"""Return all techniques for a given tactic."""
...
def list_by_ids(self, ids: list[uuid.UUID]) -> list[TechniqueEntity]:
"""Return techniques matching a list of primary keys."""
...
# ── Batch queries (for scoring/heatmap performance) ──────────
def count_by_status(self) -> dict[TechniqueStatus, int]:
"""Return technique counts grouped by status_global.
Single SQL query — replaces the per-technique counting pattern."""
...
def find_all_with_test_counts(self) -> list[TechniqueWithCounts]:
"""Return all techniques with pre-aggregated test counts and
detection rule counts. Single query with subqueries — eliminates
the N+1 pattern in heatmap and scoring."""
...
# ── Mutations ────────────────────────────────────────────────
def save(self, technique: TechniqueEntity) -> TechniqueEntity:
"""Persist a new or updated technique. Returns the saved entity."""
...
def exists_by_mitre_id(self, mitre_id: str) -> bool:
"""Check existence without loading the full entity."""
...
```
**Key design decisions:**
- Uses `typing.Protocol` (structural subtyping) rather than `ABC` — no need for the implementation to explicitly inherit. This is idiomatic Python and works with `isinstance()` checks via `@runtime_checkable`.
- Methods return domain entities (`TechniqueEntity`), never ORM models.
- Batch methods (`count_by_status`, `find_all_with_test_counts`) are designed to eliminate the N+1 patterns identified in SR-001 and SR-003.
- No `Session` parameter — the session is an implementation detail of the SQLAlchemy repository.
### 6.2. Infrastructure Implementation — SQLAlchemy
```python
# infrastructure/persistence/repositories/sa_technique_repository.py
import uuid
from typing import NamedTuple
from sqlalchemy import func
from sqlalchemy.orm import Session, joinedload
from app.domain.enums import TechniqueStatus
from app.domain.entities.technique import TechniqueEntity
from app.domain.ports.repositories.technique_repository import TechniqueRepository
from app.infrastructure.persistence.orm.technique_model import Technique
from app.infrastructure.persistence.orm.test_model import Test
from app.infrastructure.persistence.orm.detection_rule_model import DetectionRule
from app.infrastructure.persistence.mappers.technique_mapper import TechniqueMapper
class TechniqueWithCounts(NamedTuple):
"""Pre-aggregated technique data for heatmap/scoring."""
entity: TechniqueEntity
test_count: int
validated_test_count: int
detection_rule_count: int
class SATechniqueRepository:
"""SQLAlchemy implementation of TechniqueRepository.
Receives a Session from the Unit of Work — does NOT create its own.
Does NOT call commit() — that is the Unit of Work's responsibility.
"""
def __init__(self, session: Session) -> None:
self._session = session
# ── Single-entity access ─────────────────────────────────────
def find_by_id(self, technique_id: uuid.UUID) -> TechniqueEntity | None:
model = self._session.query(Technique).filter(
Technique.id == technique_id
).first()
return TechniqueMapper.to_entity(model) if model else None
def find_by_mitre_id(self, mitre_id: str) -> TechniqueEntity | None:
model = self._session.query(Technique).filter(
Technique.mitre_id == mitre_id
).first()
return TechniqueMapper.to_entity(model) if model else None
def find_by_mitre_id_with_tests(self, mitre_id: str) -> TechniqueEntity | None:
model = (
self._session.query(Technique)
.options(joinedload(Technique.tests))
.filter(Technique.mitre_id == mitre_id)
.first()
)
return TechniqueMapper.to_entity_with_tests(model) if model else None
# ── List access ──────────────────────────────────────────────
def list_all(
self,
*,
tactic: str | None = None,
status: TechniqueStatus | None = None,
review_required: bool | None = None,
) -> list[TechniqueEntity]:
query = self._session.query(Technique)
if tactic is not None:
query = query.filter(Technique.tactic == tactic)
if status is not None:
query = query.filter(Technique.status_global == status)
if review_required is not None:
query = query.filter(Technique.review_required == review_required)
models = query.order_by(Technique.mitre_id).all()
return [TechniqueMapper.to_entity(m) for m in models]
def list_by_tactic(self, tactic: str) -> list[TechniqueEntity]:
models = (
self._session.query(Technique)
.filter(Technique.tactic == tactic)
.order_by(Technique.mitre_id)
.all()
)
return [TechniqueMapper.to_entity(m) for m in models]
def list_by_ids(self, ids: list[uuid.UUID]) -> list[TechniqueEntity]:
models = (
self._session.query(Technique)
.filter(Technique.id.in_(ids))
.all()
)
return [TechniqueMapper.to_entity(m) for m in models]
# ── Batch queries ────────────────────────────────────────────
def count_by_status(self) -> dict[TechniqueStatus, int]:
rows = (
self._session.query(
Technique.status_global,
func.count(Technique.id),
)
.group_by(Technique.status_global)
.all()
)
result = {s: 0 for s in TechniqueStatus}
for status_val, count in rows:
result[status_val] = count
return result
def find_all_with_test_counts(self) -> list[TechniqueWithCounts]:
"""Single query that replaces the N+1 pattern.
Instead of: for each technique → query tests → query rules
This does: one query with subqueries for counts.
"""
test_count_sq = (
self._session.query(
Test.technique_id,
func.count(Test.id).label("test_count"),
func.count(Test.id).filter(Test.state == "validated").label("validated_count"),
)
.group_by(Test.technique_id)
.subquery()
)
rule_count_sq = (
self._session.query(
DetectionRule.mitre_technique_id,
func.count(DetectionRule.id).label("rule_count"),
)
.group_by(DetectionRule.mitre_technique_id)
.subquery()
)
rows = (
self._session.query(
Technique,
func.coalesce(test_count_sq.c.test_count, 0),
func.coalesce(test_count_sq.c.validated_count, 0),
func.coalesce(rule_count_sq.c.rule_count, 0),
)
.outerjoin(test_count_sq, Technique.id == test_count_sq.c.technique_id)
.outerjoin(rule_count_sq, Technique.mitre_id == rule_count_sq.c.mitre_technique_id)
.order_by(Technique.mitre_id)
.all()
)
return [
TechniqueWithCounts(
entity=TechniqueMapper.to_entity(tech),
test_count=tc,
validated_test_count=vtc,
detection_rule_count=rc,
)
for tech, tc, vtc, rc in rows
]
# ── Mutations ────────────────────────────────────────────────
def save(self, technique: TechniqueEntity) -> TechniqueEntity:
model = TechniqueMapper.to_model(technique)
merged = self._session.merge(model)
self._session.flush() # flush to get generated values, but do NOT commit
return TechniqueMapper.to_entity(merged)
def exists_by_mitre_id(self, mitre_id: str) -> bool:
return (
self._session.query(Technique.id)
.filter(Technique.mitre_id == mitre_id)
.first()
) is not None
```
**Key design decisions:**
- **No `commit()`**: The repository flushes but never commits. Transaction control belongs to the Unit of Work, which the use case manages.
- **Returns domain entities**: The mapper converts ORM models to domain entities at the repository boundary. No ORM model ever crosses into the application or domain layers.
- **Batch method**: `find_all_with_test_counts()` replaces the N+1 pattern with subqueries — reducing 1,400+ queries to 1 for the heatmap.
### 6.3. Injection into a Use Case
```python
# presentation/dependencies/repositories.py
from fastapi import Depends
from sqlalchemy.orm import Session
from app.domain.ports.repositories.technique_repository import TechniqueRepository
from app.infrastructure.persistence.database import get_db
from app.infrastructure.persistence.repositories.sa_technique_repository import (
SATechniqueRepository,
)
def get_technique_repository(
db: Session = Depends(get_db),
) -> TechniqueRepository:
"""FastAPI dependency that provides a TechniqueRepository.
Wiring lives ONLY in the presentation layer — the use case
never knows it's getting a SQLAlchemy implementation.
"""
return SATechniqueRepository(db)
```
```python
# presentation/dependencies/use_cases.py
from fastapi import Depends
from app.application.use_cases.techniques.create_technique import CreateTechniqueUseCase
from app.domain.ports.repositories.technique_repository import TechniqueRepository
from app.presentation.dependencies.repositories import get_technique_repository
def get_create_technique_use_case(
technique_repo: TechniqueRepository = Depends(get_technique_repository),
) -> CreateTechniqueUseCase:
return CreateTechniqueUseCase(technique_repo=technique_repo)
```
```python
# application/use_cases/techniques/create_technique.py
import uuid
from app.domain.entities.technique import TechniqueEntity
from app.domain.exceptions import DuplicateEntityError
from app.domain.ports.repositories.technique_repository import TechniqueRepository
from app.application.dto.technique_dto import CreateTechniqueInput, TechniqueResult
class CreateTechniqueUseCase:
"""Application use case: create a new MITRE ATT&CK technique.
This class knows NOTHING about:
- FastAPI, HTTP, Pydantic
- SQLAlchemy, PostgreSQL
- How the repository is implemented
"""
def __init__(self, technique_repo: TechniqueRepository) -> None:
self._repo = technique_repo
def execute(self, input: CreateTechniqueInput, user_id: uuid.UUID) -> TechniqueResult:
# Business rule: mitre_id must be unique
if self._repo.exists_by_mitre_id(input.mitre_id):
raise DuplicateEntityError("Technique", "mitre_id", input.mitre_id)
# Create domain entity
technique = TechniqueEntity.create(
mitre_id=input.mitre_id,
name=input.name,
description=input.description,
tactic=input.tactic,
platforms=input.platforms,
)
# Persist through repository
saved = self._repo.save(technique)
# Return application DTO
return TechniqueResult.from_entity(saved)
```
```python
# presentation/api/v1/techniques.py (refactored — thin router)
from fastapi import APIRouter, Depends, status
from app.application.use_cases.techniques.create_technique import CreateTechniqueUseCase
from app.domain.exceptions import DuplicateEntityError, EntityNotFoundError
from app.presentation.dependencies.auth import get_current_user, require_role
from app.presentation.dependencies.use_cases import get_create_technique_use_case
from app.presentation.schemas.technique_schema import TechniqueCreate, TechniqueOut
router = APIRouter(prefix="/techniques", tags=["techniques"])
@router.post("", response_model=TechniqueOut, status_code=status.HTTP_201_CREATED)
def create_technique(
payload: TechniqueCreate,
use_case: CreateTechniqueUseCase = Depends(get_create_technique_use_case),
current_user = Depends(require_role("admin")),
):
"""Create a new technique.
This router:
- Receives the HTTP request (Pydantic validates it)
- Calls the use case
- The error handler middleware maps domain exceptions to HTTP responses
- Returns the result
Total: 5 lines of actual logic.
"""
result = use_case.execute(
input=CreateTechniqueInput(
mitre_id=payload.mitre_id,
name=payload.name,
description=payload.description,
tactic=payload.tactic,
platforms=payload.platforms,
),
user_id=current_user.id,
)
return result
```
### 6.4. Coexistence Strategy — No Big Bang
The repository can be introduced **alongside existing code** without breaking anything:
```
Phase 1: Create the repository interface and SQLAlchemy implementation.
Both old (direct db.query) and new (repository) code coexist.
New endpoints use the repository. Old endpoints are unchanged.
Phase 2: Migrate routers one endpoint at a time.
Replace db.query(Technique).filter(...) with repo.find_by_mitre_id().
Each migration is a small, reviewable PR.
Phase 3: When all consumers are migrated, the ORM model is no longer
imported outside infrastructure/. Enforce via linting rule.
```
At no point does existing functionality break. Both patterns access the same database, the same tables, the same session. The repository is an additive abstraction — it wraps what already exists.