kitos/Aegis

Fork 0

Files

Kitos 0b65f51d1c

Aegis CI / lint-and-test (push) Has been cancelled

Details

docs: update architecture analysis and tech debt docs to reflect resolved items

2026-02-18 19:27:52 +01:00

33 KiB

Raw Blame History

Aegis — Deep Architectural Analysis

Author: Automated architecture review
Date: February 11, 2026 (updated February 18, 2026)
Scope: Backend (FastAPI/Python), Frontend (React/TypeScript), Infrastructure (Docker)

Note: Sections marked with ✅ reflect changes implemented since the initial analysis.

Current Architecture
Coupling Analysis
Business Logic vs Infrastructure Separation
SOLID Evaluation
Architectural Risks
Refactor Proposal Towards Clean Architecture
Executive Summary

1. Current Architecture

1.1. Classification: Layered Monolith with Incomplete Service Layer

Aegis follows a layered monolithic architecture deployed as two containers (backend + frontend) with a partial and inconsistent level of separation. It is not Clean Architecture, nor Hexagonal, nor microservices.

┌─────────────────────────────────────────────────┐
│                   FRONTEND                       │
│         React 19 + TypeScript + Vite             │
│  ┌──────────┐  ┌──────────┐  ┌───────────────┐  │
│  │  Pages   │→ │ API Layer│→ │ Axios Client  │  │
│  │(21 pages)│  │(22 mods) │  │(HttpOnly JWT) │  │
│  └──────────┘  └──────────┘  └───────────────┘  │
└────────────────────────┬────────────────────────┘
                         │ HTTP/REST
┌────────────────────────▼────────────────────────┐
│                   BACKEND                        │
│              FastAPI + SQLAlchemy                 │
│                                                  │
│  ┌─────────────────────────────────────────────┐ │
│  │              Router Layer (21 routers)       │ │
│  │  Contains: validation, queries, partial     │ │
│  │  business logic, serialization, auditing    │ │
│  └────────┬──────────────────┬─────────────────┘ │
│           │                  │                    │
│  ┌────────▼───────┐  ┌──────▼──────────────────┐ │
│  │ Service Layer  │  │   Direct DB Access       │ │
│  │ (20 services)  │  │   (SQLAlchemy queries    │ │
│  │ Partial: only  │  │    inside routers)       │ │
│  │ for workflows  │  │                          │ │
│  └────────┬───────┘  └──────┬──────────────────┘ │
│           │                  │                    │
│  ┌────────▼──────────────────▼─────────────────┐ │
│  │          Model Layer (18 models)             │ │
│  │     SQLAlchemy ORM — Anemic Domain Models    │ │
│  └────────────────────┬────────────────────────┘ │
│                       │                          │
│  ┌────────────────────▼────────────────────────┐ │
│  │          Database Layer                      │ │
│  │  PostgreSQL + MinIO (evidence storage)       │ │
│  └─────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

1.2. Actual Distribution of Responsibilities

Layer	Files	Actual Responsibility
Routers	21 files	Validation, auth, direct SQL queries, partial business logic, serialization, CSV/JSON report generation
Services	20 files	Complex workflows (test state machine, scoring, notifications), external data source imports
Models	18 files	ORM table definitions — purely anemic (no behavior)
Schemas	10 files	Pydantic DTOs for request/response
Database	1 file	Session factory and `get_db()` generator

1.3. The Core Problem: Two Coexisting Patterns

Aegis has a split architectural personality:

Pattern A — Router-as-Controller (direct CRUD):
Routers like techniques.py, evidence.py, users.py, audit.py, reports.py, heatmap.py, metrics.py, detection_rules.py, threat_actors.py execute SQLAlchemy queries directly:

# techniques.py — direct query inside the router
query = db.query(Technique)
if tactic is not None:
    query = query.filter(Technique.tactic == tactic)
return query.order_by(Technique.mitre_id).all()

Pattern B — Router-delegates-to-Service:
Routers like tests.py, scores.py, notifications.py, campaigns.py delegate to services:

# tests.py — delegates to workflow service
wf_start_execution(db=db, test=test, user=current_user)

The result: There is no clear contract about where logic lives. A new developer cannot predict whether to look for logic in the router or in a service.

2. Coupling Analysis

2.1. Coupling Matrix

              Routers    Services    Models    Database    Schemas    Config
Routers         —         MEDIUM     HIGH      HIGH        HIGH      LOW
Services       LOW          —        HIGH      HIGH        NONE      MEDIUM
Models         NONE       NONE        —        HIGH        NONE      NONE
Schemas        NONE       NONE       LOW        —          NONE      NONE
Database       NONE       NONE       NONE       —          NONE      LOW

2.2. Router ↔ Model — HIGH COUPLING (Critical)

Routers import and use SQLAlchemy models directly. 11 out of 21 routers execute SQL queries without an intermediary:

Router	Directly Imported Models	Queries Inside Router
`techniques.py`	Technique	`db.query(Technique).filter(...)`
`evidence.py`	Evidence, Test	`db.query(Evidence).filter(...)`
`users.py`	User	`db.query(User).filter(...)`
`audit.py`	AuditLog	`db.query(AuditLog).filter(...)`
`reports.py`	Technique, Test	`db.query(Technique)...`, `db.query(Test)...`
`heatmap.py`	Technique, Test, ThreatActor, DetectionRule, Campaign, DefensiveTechniqueMapping	Multiple complex queries
`metrics.py`	Technique, Test	Aggregations with `func.count`
`detection_rules.py`	DetectionRule, TestDetectionResult	Direct CRUD
`threat_actors.py`	ThreatActor, ThreatActorTechnique, Technique	Queries with joins
`data_sources.py`	DataSource, Technique, Test	CRUD + stats queries
`compliance.py`	ComplianceFramework, ComplianceControl, etc.	Compliance queries

Impact: Changing a table schema requires modifying both the model and every router that queries it directly. There is no indirection.

2.3. Router ↔ Database — HIGH COUPLING

All routers receive db: Session = Depends(get_db) and operate with the SQLAlchemy session directly. This means:

Routers know the ORM (db.query, db.add, db.commit, joinedload)
Routers handle transactions implicitly
There is no persistence abstraction — migrating from SQLAlchemy to another ORM or raw queries would require rewriting all routers

2.4. Service ↔ Model/Database — HIGH COUPLING

Services also access SQLAlchemy directly:

# scoring_service.py
all_tests = db.query(Test).filter(Test.technique_id == technique.id).all()

# notification_service.py
notif = db.query(Notification).filter(...).first()

Services do not use repositories or abstractions — they are essentially functions that orchestrate queries and logic.

2.5. Service ↔ Service — MEDIUM COUPLING

Inter-service coupling exists:

test_workflow_service → audit_service + notification_service
scoring_service reads from settings directly (mutable global config)
campaign_scheduler_service → campaign_service

There is no dependency injection between services — everything is direct imports.

2.6. Service ↔ Framework — ✅ RESOLVED (was HIGH COUPLING)

~~Domain services import HTTPException from FastAPI.~~

Update (Feb 18): test_workflow_service.py now raises domain exceptions (InvalidOperationError, InvalidStateTransition) from app.domain.exceptions. The middleware/error_handler.py maps these to HTTP responses automatically. Services no longer import HTTPException.

# Current: domain/errors.py exceptions mapped by middleware
raise InvalidStateTransition(current_state=..., target_state=..., entity_type="Test")
# middleware/error_handler.py → 400 Bad Request automatically

2.7. Frontend ↔ Backend — LOW COUPLING (Correct)

Communication is via REST API with aligned but independent types (types/models.ts vs schemas/*.py). The frontend uses Axios with interceptors — good decoupling.

3. Business Logic vs Infrastructure Separation

3.1. Diagnosis: INSUFFICIENT SEPARATION

Aspect	Status	Detail
Workflow logic	PARTIAL	`test_workflow_service.py` correctly encapsulates the state machine. It is the best designed service.
Scoring	PARTIAL	`scoring_service.py` encapsulates calculations but accesses DB directly and reads `settings` as mutable global state.
CRUD	NOT SEPARATED	CRUD operations live in routers, mixed with HTTP concerns.
Report generation	NOT SEPARATED	`reports.py` (router) builds complex CSVs and JSONs with inline queries of 50+ lines.
Heatmap/visualization	NOT SEPARATED	`heatmap.py` (router) has ~500 lines with all ATT&CK Navigator mapping logic embedded.
Metrics	NOT SEPARATED	`metrics.py` and `operational_metrics.py` (routers) have complex aggregation queries.
Data import	WELL SEPARATED	The 8 import services (`atomic_import_service`, `sigma_import_service`, etc.) are correctly isolated.
Notifications	WELL SEPARATED	`notification_service.py` encapsulates all logic.
Auditing	WELL SEPARATED	`audit_service.py` is a pure `log_action()` function.

3.2. Anemic Model (Anti-pattern)

SQLAlchemy models are purely declarative — they have no business methods:

# models/test.py — columns only, zero behavior
class Test(Base):
    __tablename__ = "tests"
    id = Column(UUID, primary_key=True)
    state = Column(Enum(TestState))
    # ... more columns
    # Missing: can_transition(), validate(), calculate_score()

Logic that should be in domain models (business validations, state transitions, calculations) is scattered across routers and services.

3.3. Infrastructure Bleeding Into Logic

Infrastructure	Where It Appears Inappropriately
`SQLAlchemy Session`	Inside domain services (scoring, workflow, notifications)
`FastAPI HTTPException`	Inside domain services (test_workflow_service)
`MinIO/boto3`	`storage.py` is well isolated, but called from routers directly
`APScheduler`	Directly coupled in `jobs/mitre_sync_job.py` with `SessionLocal()`

4. SOLID Evaluation

4.1. Single Responsibility Principle (SRP) — PARTIAL VIOLATION

Component	Compliant?	Issue
`heatmap.py` (router)	NO	528 lines — HTTP handling + query building + color mapping + Navigator JSON serialization + export logic
`reports.py` (router)	NO	HTTP handling + aggregation queries + CSV generation + JSON formatting
`tests.py` (router)	PARTIAL	Delegates workflow but maintains CRUD, template instantiation, timeline queries
`scoring_service.py`	PARTIAL	Scoring + mutable global config reading + direct queries
`test_workflow_service.py`	YES	Single responsibility: test state machine
`notification_service.py`	YES	Single responsibility: notification management
`audit_service.py`	YES	Single responsibility: audit logging

Verdict: Well-isolated services comply with SRP. "Fat routers" flagrantly violate it.

4.2. Open/Closed Principle (OCP) — VIOLATION

Scoring weights: Scoring weights are read from settings (mutable global object). The scores.py router allows mutating settings directly at runtime via a PATCH endpoint. This is a global change without persistence that affects all requests.
Heatmap layers: Each heatmap type is a separate endpoint with hardcoded logic. Adding a new layer type requires modifying the router.
Import services: Each data source is a separate service (atomic_import_service, sigma_import_service, etc.) without a common interface. Adding a new source requires creating a new service AND modifying data_sources.py and system.py.
Test states: The state machine is well defined in VALID_TRANSITIONS, but adding a new state requires modifying the dictionary AND potentially all services that read TestState.

4.3. Liskov Substitution Principle (LSP) — N/A (Partial)

There is no significant inheritance or polymorphism in the backend. Services are functions, not classes. There are no interfaces or abstract classes. Does not directly apply, but the absence of formal contracts (protocols/ABCs) is a symptom of not being designed for extensibility.

4.4. Interface Segregation Principle (ISP) — VIOLATION

No interfaces (Protocol or ABC) exist anywhere in the project.
Services expose loose functions, not contracts.
Routers depend on complete services when they only use one or two functions.
The Settings object is a monolithic entity with ~15 properties injected as a global.

4.5. Dependency Inversion Principle (DIP) — ✅ PARTIALLY RESOLVED (was SEVERE VIOLATION)

Update (Feb 18): Protocol interfaces and abstractions now exist:

# domain/ports/repositories/ — Protocol interfaces
class TechniqueRepository(Protocol):
    def find_by_id(self, technique_id: UUID) -> TechniqueEntity | None: ...
    def save(self, technique: TechniqueEntity) -> TechniqueEntity: ...

# dependencies/repositories.py — FastAPI Depends() wiring
def get_technique_repository(db=Depends(get_db)) -> SATechniqueRepository: ...

Domain layer has zero framework imports (no FastAPI, no SQLAlchemy).
Repository ports define contracts; infrastructure implements them.
test_workflow_service.py now uses domain exceptions instead of HTTPException.
UnitOfWork manages transactions.

Remaining: Some services still use direct imports for audit_service, notification_service. Full DIP adoption is incremental.

5. Architectural Risks

5.1. CRITICAL RISK: God Routers

Router	Lines	Complexity
`tests.py`	664	15+ endpoints, CRUD + workflow + template instantiation
`heatmap.py`	528	5 endpoints, color logic, Navigator export
`campaigns.py`	~400+	CRUD + scheduling + threat actor generation
`reports.py`	273	4 endpoints with complex aggregation queries
`compliance.py`	~350+	CRUD + import + gap analysis + CSV export

These routers are Fat Controllers — they contain logic that should be in services, repositories, or domain objects.

5.2. CRITICAL RISK: In-Memory Token Blacklist ✅ RESOLVED

Update (Feb 18): The token blacklist is now Redis-backed via infrastructure/redis_client.py. Tokens are stored with TTL matching expiration. Shared across all workers and survives restarts.

5.3. HIGH RISK: Mutable Settings at Runtime

# scores.py — direct mutation of global settings
settings.SCORING_WEIGHT_TESTS = body.weight_tests
settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules

Changes do not persist between restarts.
A server restart loses custom scoring configuration.
Thread-unsafe if FastAPI runs with multiple workers.
Violates the configuration immutability principle.

5.4. HIGH RISK: No Repository Layer ✅ PARTIALLY RESOLVED

Update (Feb 18): Repository ports and implementations now exist:

domain/ports/repositories/ — Protocol interfaces for TechniqueRepository and TestRepository.
infrastructure/persistence/repositories/ — SQLAlchemy implementations (SATechniqueRepository, SATestRepository) with batch query methods.
dependencies/repositories.py — FastAPI Depends() wiring.

Remaining: Old routers still use direct db.query(). Migration is incremental — new endpoints use repositories, old ones coexist.

5.5. HIGH RISK: No CI/CD ✅ RESOLVED

Update (Feb 18): GitHub Actions CI pipeline exists at .github/workflows/ci.yml:

Runs ruff lint + pytest on every push/PR.
Uses PostgreSQL + Redis service containers (production-like environment).
Local validation via scripts/agent_validate_backend.sh.

5.6. MEDIUM RISK: Background Jobs with Own Sessions

# mitre_sync_job.py
db = SessionLocal()
try:
    sync_mitre(db)
finally:
    db.close()

Background jobs create sessions outside the request lifecycle. This is technically correct, but:

No robust error handling (no retry mechanism).
No observability (no structured logging).
No dead letter queue for failed jobs.

5.7. MEDIUM RISK: Anemic Models ✅ PARTIALLY RESOLVED

Update (Feb 18): Rich domain entities now exist alongside ORM models:

domain/test_entity.py — Full state machine with business logic, domain events, dual validation, timers.
domain/entities/technique.py — Status recalculation, review lifecycle, MITRE ID validation.
domain/value_objects/ — MitreId, ScoringWeights (immutable, validated).
ORM models remain anemic by design (persistence mapping only). Business logic lives in domain entities.

Remaining: Campaign, ComplianceFramework, ThreatActor still lack domain entity counterparts.

5.8. MEDIUM RISK: No Explicit Transaction Management ✅ PARTIALLY RESOLVED

Update (Feb 18): A UnitOfWork context manager exists at domain/unit_of_work.py with explicit commit(), rollback(), and flush(). Used by test_workflow_service.py which explicitly states "The caller (router) is responsible for committing the session via the Unit of Work pattern."

Remaining: Some services like audit_service.py still call db.commit() directly. Needs incremental migration.

5.9. LOW RISK: No Semantic API Versioning

The API is under /api/v1 but there is no mechanism to support v2 without duplicating entire routers.

6. Refactor Proposal Towards Clean Architecture

6.1. Target Structure

backend/
├── app/
│   ├── main.py                          # FastAPI setup (minimal)
│   ├── config.py                        # Settings (immutable)
│   │
│   ├── domain/                          # ★ DOMAIN LAYER (no external dependencies)
│   │   ├── entities/                    # Entities with behavior
│   │   │   ├── test.py                  # Test entity with can_transition(), validate()
│   │   │   ├── technique.py             # Technique with calculate_status()
│   │   │   ├── campaign.py              # Campaign with add_test(), activate()
│   │   │   └── ...
│   │   ├── value_objects/               # Immutable value objects
│   │   │   ├── score.py                 # TechniqueScore, OrganizationScore
│   │   │   ├── test_state.py            # TestState with valid transitions
│   │   │   └── mitre_id.py              # MitreId with validation
│   │   ├── exceptions.py               # Domain exceptions (NOT HTTPException)
│   │   │   # InvalidTransitionError, EntityNotFoundError, etc.
│   │   ├── events.py                   # Domain events
│   │   │   # TestValidated, TestRejected, CampaignCompleted
│   │   └── ports/                       # ★ INTERFACES (ABCs / Protocols)
│   │       ├── repositories/
│   │       │   ├── test_repository.py   # ABC: find_by_id(), save(), list_by_technique()
│   │       │   ├── technique_repository.py
│   │       │   ├── campaign_repository.py
│   │       │   └── ...
│   │       ├── services/
│   │       │   ├── storage_port.py      # ABC: upload_file(), get_presigned_url()
│   │       │   ├── notification_port.py # ABC: send_notification()
│   │       │   └── event_bus_port.py    # ABC: publish(event)
│   │       └── auth/
│   │           └── token_service_port.py
│   │
│   ├── application/                     # ★ APPLICATION LAYER (use cases)
│   │   ├── use_cases/
│   │   │   ├── tests/
│   │   │   │   ├── create_test.py       # CreateTestUseCase
│   │   │   │   ├── start_execution.py   # StartExecutionUseCase
│   │   │   │   ├── submit_red.py
│   │   │   │   ├── validate_test.py
│   │   │   │   └── get_retest_chain.py
│   │   │   ├── scoring/
│   │   │   │   ├── calculate_technique_score.py
│   │   │   │   └── calculate_organization_score.py
│   │   │   ├── campaigns/
│   │   │   │   ├── create_campaign.py
│   │   │   │   └── generate_from_threat_actor.py
│   │   │   ├── heatmap/
│   │   │   │   ├── generate_coverage_layer.py
│   │   │   │   └── export_navigator.py
│   │   │   └── reports/
│   │   │       ├── generate_coverage_report.py
│   │   │       └── export_coverage_csv.py
│   │   ├── dto/                         # Input/Output DTOs for use cases
│   │   │   ├── test_dto.py
│   │   │   └── ...
│   │   └── interfaces/                  # Application-level ports
│   │       └── unit_of_work.py          # ABC: UnitOfWork with commit/rollback
│   │
│   ├── infrastructure/                  # ★ INFRASTRUCTURE LAYER (implementations)
│   │   ├── persistence/
│   │   │   ├── orm/                     # SQLAlchemy models (mapping only)
│   │   │   │   ├── test_model.py
│   │   │   │   ├── technique_model.py
│   │   │   │   └── ...
│   │   │   ├── repositories/            # Concrete implementations
│   │   │   │   ├── sqlalchemy_test_repository.py
│   │   │   │   ├── sqlalchemy_technique_repository.py
│   │   │   │   └── ...
│   │   │   ├── unit_of_work.py          # SQLAlchemy UoW implementation
│   │   │   └── database.py              # Engine, session factory
│   │   ├── storage/
│   │   │   └── minio_storage.py         # Implements StoragePort
│   │   ├── external/                    # Import services
│   │   │   ├── mitre_sync.py
│   │   │   ├── atomic_import.py
│   │   │   ├── sigma_import.py
│   │   │   └── ...
│   │   ├── auth/
│   │   │   ├── jwt_service.py           # Implements TokenServicePort
│   │   │   └── token_blacklist.py       # Redis-backed blacklist
│   │   ├── notifications/
│   │   │   └── db_notification_service.py
│   │   ├── jobs/
│   │   │   └── scheduler.py             # APScheduler setup
│   │   └── cache/
│   │       └── redis_cache.py           # Score caching (Redis)
│   │
│   └── presentation/                    # ★ PRESENTATION LAYER (HTTP)
│       ├── api/
│       │   ├── v1/
│       │   │   ├── tests.py             # Routing + request/response mapping only
│       │   │   ├── techniques.py
│       │   │   ├── heatmap.py
│       │   │   └── ...
│       │   └── dependencies.py          # FastAPI Depends() wiring
│       ├── schemas/                     # Pydantic schemas (request/response)
│       │   ├── test_schema.py
│       │   └── ...
│       ├── middleware/
│       │   ├── error_handler.py         # Domain exceptions → HTTP responses
│       │   └── rate_limiter.py
│       └── mappers/                     # Entity ↔ Schema mappers
│           ├── test_mapper.py
│           └── ...

6.2. Dependency Rules

Presentation → Application → Domain ← Infrastructure
     ↓              ↓           ↑           ↑
  FastAPI       Use Cases    Entities    SQLAlchemy
  Pydantic      DTOs        Ports       MinIO
                                        Redis
                                        APScheduler

The golden rule: Dependencies only point towards the center (Domain). Infrastructure implements the ports defined in Domain.

6.3. Key Changes by Layer

Domain Layer (New)

# domain/entities/test.py — Rich entity (not anemic)
class TestEntity:
    def __init__(self, id, state, technique_id, ...):
        self._state = state

    def can_transition_to(self, target: TestState) -> bool:
        return target in VALID_TRANSITIONS[self._state]

    def start_execution(self, user: UserEntity) -> list[DomainEvent]:
        if not self.can_transition_to(TestState.red_executing):
            raise InvalidTransitionError(self._state, TestState.red_executing)
        self._state = TestState.red_executing
        return [TestExecutionStarted(test_id=self.id, user_id=user.id)]

# domain/exceptions.py — Domain exceptions, NOT HTTPException
class InvalidTransitionError(DomainException):
    def __init__(self, current: TestState, target: TestState):
        self.current = current
        self.target = target

# domain/ports/repositories/test_repository.py — Abstract interface
class TestRepository(Protocol):
    def find_by_id(self, test_id: UUID) -> TestEntity | None: ...
    def save(self, test: TestEntity) -> None: ...
    def list_by_technique(self, technique_id: UUID) -> list[TestEntity]: ...

Application Layer (Use Cases)

# application/use_cases/tests/start_execution.py
class StartExecutionUseCase:
    def __init__(self, test_repo: TestRepository, uow: UnitOfWork):
        self._test_repo = test_repo
        self._uow = uow

    def execute(self, test_id: UUID, user_id: UUID) -> TestDTO:
        with self._uow:
            test = self._test_repo.find_by_id(test_id)
            if not test:
                raise EntityNotFoundError("Test", test_id)
            events = test.start_execution(user)
            self._test_repo.save(test)
            self._uow.commit()
            # events are published after commit
            return TestDTO.from_entity(test)

Presentation Layer (Slim Routers)

# presentation/api/v1/tests.py — HTTP concerns only
@router.post("/{test_id}/start-execution")
def start_execution(
    test_id: UUID,
    use_case: StartExecutionUseCase = Depends(get_start_execution_use_case),
    current_user: User = Depends(get_current_user),
):
    try:
        result = use_case.execute(test_id, current_user.id)
        return result
    except EntityNotFoundError:
        raise HTTPException(404)
    except InvalidTransitionError as e:
        raise HTTPException(400, detail=str(e))

Infrastructure Layer (Implementations)

# infrastructure/persistence/repositories/sqlalchemy_test_repository.py
class SQLAlchemyTestRepository(TestRepository):
    def __init__(self, session: Session):
        self._session = session

    def find_by_id(self, test_id: UUID) -> TestEntity | None:
        model = self._session.query(TestModel).filter(TestModel.id == test_id).first()
        return TestMapper.to_entity(model) if model else None

    def save(self, test: TestEntity) -> None:
        model = TestMapper.to_model(test)
        self._session.merge(model)

6.4. Incremental Migration Plan (Phases)

The refactor must be incremental — not big bang. Each phase delivers value and the system continues working.

Phase 1: Foundations (1-2 weeks)

Create the directory structure: domain/, application/, infrastructure/, presentation/.
Create domain/exceptions.py with domain exceptions.
Create error_handler.py middleware that maps domain exceptions → HTTP responses.
Create domain/ports/repositories/ with Protocol interfaces for the 3-4 most used entities (Test, Technique, Campaign).
Create SQLAlchemy implementations of these repositories.
Do not move routers yet.

Phase 2: Extract the Test Domain (1-2 weeks)

Create domain/entities/test.py with the state machine (extract from test_workflow_service).
Create use cases for each state transition.
Migrate the tests.py router to use the use cases.
Remove HTTPException from test_workflow_service.
Pure unit tests for the domain entity (no DB).

Phase 3: Extract Fat Services from Routers (2-3 weeks)

Move heatmap.py logic to application/use_cases/heatmap/.
Move reports.py logic to application/use_cases/reports/.
Move metrics.py logic to application services.
Routers become thin controllers (< 20 lines per endpoint).

Phase 4: Complete Repository Pattern (1-2 weeks)

Create repositories for all remaining entities.
Migrate scattered queries from routers to repositories.
Remove db.query(...) from any file outside infrastructure/.

Phase 5: Robust Infrastructure (1-2 weeks)

Move token blacklist to Redis.
Implement the Unit of Work pattern.
Move scoring config to the database (not mutable settings).
Add event bus for domain events (notifications, auditing).

Phase 6: CI/CD and Observability

Set up GitHub Actions (lint, type check, tests).
Add structured logging.
Add improved health checks.

7. Executive Summary

Current Strengths

Strength	Detail
Well-modeled domain	The data model covers ATT&CK, D3FEND, compliance, threat actors, and campaigns comprehensively
Solid test workflow	The state machine in `test_workflow_service` is the best designed component
Clean frontend	API/pages/components separation with TanStack Query is correct
Secure auth	HttpOnly cookies + RBAC with 6 well-defined roles
Import services	The 8 import services are well encapsulated
Existing tests	18 test files with fixtures — a foundation to build upon

Critical Weaknesses (Updated Feb 18)

Weakness	Original Severity	Current Status
Fat controllers (routers with business logic)	HIGH	Partially resolved — heatmap extracted
No repository layer	HIGH	✅ Resolved (Test, Technique repos exist)
Services depend on FastAPI	HIGH	✅ Resolved (domain exceptions + middleware)
Anemic models	MEDIUM	✅ Partially resolved (TestEntity, TechniqueEntity)
In-memory token blacklist	HIGH	✅ Resolved (Redis-backed)
Mutable settings at runtime	MEDIUM	Open
No CI/CD	MEDIUM	✅ Resolved (GitHub Actions)
No dependency inversion	HIGH	✅ Partially resolved (ports + repos)

Final Classification

┌──────────────────────────────────────────────────────────┐
│  Type:        Clean Modular Monolith (in transition)     │
│  Maturity:    Pre-production → Production-ready          │
│  SOLID:       3.5/5 (SRP partial, DIP started, OCP/ISP  │
│               in progress)                               │
│  Testability: 6/10 (326 tests, domain unit tests, repo  │
│               integration tests)                         │
│  Coupling:    5/10 (domain layer fully decoupled, old    │
│               routers still coupled)                     │
│  Cohesion:    7/10 (domain entities own business rules)  │
│  Estimated remaining tech debt: ~2-3 weeks               │
└──────────────────────────────────────────────────────────┘

Recommendation (Updated Feb 18)

The foundational Clean Architecture layers are now in place. The migration is proceeding incrementally. The top 4 immediate priorities from the original analysis are all resolved:

~~Extract domain exceptions~~ ✅ Done
~~Create repositories for Test and Technique~~ ✅ Done
~~Move token blacklist to Redis~~ ✅ Done
~~Set up basic CI/CD~~ ✅ Done

Next priorities:

Migrate fat routers to use repositories (incremental, per-router)
Persist scoring weights in database
Create domain entities for Campaign and ComplianceFramework
Add structured JSON logging

33 KiB Raw Blame History