kitos/Aegis

Fork 0

Files

Kitos 0b65f51d1c

Aegis CI / lint-and-test (push) Has been cancelled

Details

docs: update architecture analysis and tech debt docs to reflect resolved items

2026-02-18 19:27:52 +01:00

30 KiB

Raw Blame History

Aegis — Technical Debt, Risks & Improvement Plan

Author: Architecture review
Date: February 11, 2026 (updated February 18, 2026)
Scope: Backend, Frontend, Infrastructure, Security, Scalability, Maintainability

Note: Items marked with ✅ have been resolved. See inline status annotations.

Technical Debt
Scalability Risks
Security Risks
Maintainability Risks
Recommended Medium-Term Improvements
Priority Matrix

1. Technical Debt

HIGH PRIORITY

TD-001: Fat Controllers (Routers with Embedded Business Logic)

Current state: 11 of 21 routers execute raw SQLAlchemy queries directly. The worst offenders:

Router	Lines	Embedded Logic
`heatmap.py`	528	Query building + color mapping + ATT&CK Navigator JSON serialization + export
`tests.py`	664	CRUD + template instantiation + timeline queries (workflow delegated)
`reports.py`	273	Aggregation queries + CSV generation + JSON formatting
`compliance.py`	~350	CRUD + import + gap analysis + CSV export
`metrics.py`	~316	Complex aggregation queries with in-memory processing

Impact: Cannot unit test business logic without spinning up FastAPI + DB. Logic duplication across routers. Changes to one query pattern must be replicated manually in every router that uses it.

Remediation: Extract query logic to service/repository layer. Each router endpoint should be < 20 lines.

TD-002: No Repository Layer — Scattered Duplicate Queries ✅ PARTIALLY RESOLVED

Current state (updated Feb 18): Repository ports (Protocol interfaces) and SQLAlchemy implementations now exist for Technique and Test:

domain/ports/repositories/technique_repository.py — Protocol with find_by_id(), find_by_mitre_id(), list_all(), count_by_status(), find_all_with_test_counts(), save(), etc.
domain/ports/repositories/test_repository.py — Protocol with find_by_id(), list_by_technique(), get_states_and_results_for_technique(), etc.
infrastructure/persistence/repositories/sa_technique_repository.py — Concrete implementation with batch queries (eliminates N+1 for heatmap/scoring).
infrastructure/persistence/repositories/sa_test_repository.py — Concrete implementation.
dependencies/repositories.py — FastAPI Depends() wiring.

Remaining: Old routers still use direct db.query(). New endpoints should use repositories; existing endpoints will be migrated incrementally.

Impact: New code has centralized query management. Old queries still scattered but coexist safely.

TD-003: Services Depend on FastAPI (HTTPException in Domain Logic) ✅ RESOLVED

Current state (updated Feb 18): Domain exceptions have been implemented and are in active use:

domain/errors.py — Full exception hierarchy: DomainError, EntityNotFoundError, DuplicateEntityError, InvalidStateTransition, BusinessRuleViolation, InvalidOperationError, PermissionViolation.
domain/exceptions.py — Backward-compatible re-exports.
middleware/error_handler.py — Maps domain exceptions to HTTP responses automatically (404, 409, 400, 403).
test_workflow_service.py — Now raises InvalidOperationError and InvalidStateTransition instead of HTTPException.

No further action needed for the core services. Some secondary routers may still raise HTTPException directly (which is acceptable at the presentation layer).

TD-004: Mutable Global Settings at Runtime

Current state: The scores.py router mutates settings directly:

settings.SCORING_WEIGHT_TESTS = body.weight_tests
settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules

Impact: Changes lost on restart. Thread-unsafe with multiple workers. No audit trail for config changes.

Remediation: Persist scoring weights in the database. Create a ScoringConfig table. Load weights from DB in scoring_service.

TD-005: Anemic Domain Models ✅ PARTIALLY RESOLVED

Current state (updated Feb 18): Rich domain entities now exist alongside the ORM models:

domain/test_entity.py — TestEntity dataclass with full state machine (can_transition(), transition_to(), start_execution(), submit_red_evidence(), submit_blue_evidence(), validate(), reopen()), dual validation, pause/resume timers, and domain events. Comprehensive unit tests (46 tests).
domain/entities/technique.py — TechniqueEntity with recalculate_status(), mark_reviewed(), flag_for_review(), create(), from_orm()/apply_to(). Comprehensive unit tests (16 tests).
domain/value_objects/mitre_id.py — Immutable value object with ATT&CK ID validation.
domain/value_objects/scoring_weights.py — Immutable weight set enforcing sum-to-100.

ORM models remain anemic (by design — they are persistence mapping only). Business logic lives in domain entities, bridged via from_orm()/apply_to().

Remaining: Campaign, ComplianceFramework, and other entities still lack domain entity counterparts.

MEDIUM PRIORITY

TD-006: Inconsistent Error Response Format

Current state: API error responses use three different formats:

Format	Used In
`detail: "string"`	Most routers (`techniques.py`, `users.py`, `evidence.py`)
`detail: {message, code, ...}`	`tests.py`, `test_workflow_service.py`
`detail: "Validation error", code: "VALIDATION_ERROR", errors: [...]`	Global handler in `main.py`

Impact: Frontend must handle multiple error shapes. No reliable error code for programmatic handling.

Remediation: Standardize all errors to {detail: string, code: string, errors?: [...]}.

TD-007: Silently Swallowed Exceptions in Workflow Service

Current state: test_workflow_service.py has 4 bare except Exception: pass blocks:

Line	What is swallowed
106	`notify_test_state_change()` failure
286	Notification failure
295	Notification failure
299	Score cache invalidation failure

Impact: Notification failures and cache invalidation errors go completely unnoticed. Users may miss critical workflow notifications with no trace in logs.

Remediation: Replace pass with logger.warning(...) at minimum. Consider async event dispatch so failures don't block the main flow.

TD-008: Test Suite Gaps

Current state: ~167 test functions across 18 files, but coverage is uneven:

Category	Covered	Not Covered
Routers	auth, techniques, tests, evidence, test_templates, metrics, system	audit, campaigns, compliance, d3fend, detection_rules, heatmap, operational_metrics, scores, snapshots, threat_actors, users
Services	workflow, status, atomic_import, campaign, scoring, notifications	audit, caldera, compliance_import, d3fend, elastic, intel, lolbas, mitre_sync, score_cache, sigma, threat_actor_import

4 integration tests are pytest.skipped by default (Sigma, LOLBAS, CALDERA, Elastic full imports).

Some tests use inspect.getsource() to verify code structure rather than actually calling endpoints.

Impact: Regressions in untested routers/services go undetected. No security-focused tests (injection, rate limiting, CSRF).

Remediation: Add integration tests for all routers. Add dedicated security test suite. Run skipped integration tests in CI.

TD-009: No CI/CD Pipeline ✅ RESOLVED

Current state (updated Feb 18): A fully functional CI pipeline exists at .github/workflows/ci.yml:

Runs ruff linting on every push/PR.
Runs pytest against a real PostgreSQL + Redis service container.
Tests run against the same stack as production (not SQLite).

Additionally, scripts/agent_validate_backend.sh provides a local validation script that runs lint + tests inside the Docker container.

No further action needed for basic CI. Potential enhancements: add mypy type checking, Docker build verification.

TD-010: Unstructured Logging

Current state: Logging uses plain format strings with no structured fields:

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(name)s — %(message)s",
)

Global exception handlers use logging.error(f"...") instead of a logger instance. No request ID, user ID, or correlation ID in log output.

Impact: Cannot query logs for "all actions by user X" or "all errors in request Y". Log analysis in production requires manual grep.

Remediation: Add structured JSON logging (e.g., structlog or python-json-logger). Include request_id middleware.

LOW PRIORITY

TD-011: Entrypoint Scripts Have No Retry Logic

Current state: Both entrypoint.sh and entrypoint.prod.sh use set -e. If alembic upgrade head or python -m app.seed fails, Uvicorn never starts. No retry, no clear error message.

Impact: Transient DB connection failures during container startup cause the backend to fail permanently until manually restarted (Docker restart: always will retry, but seed may fail repeatedly).

Remediation: Add retry loop for migration with backoff. Make seed idempotent and non-fatal.

TD-012: No Database Migration Tests

Current state: Alembic migrations (18 versions) are never tested in isolation. The test suite uses in-memory SQLite with tables created from models, bypassing Alembic entirely.

Impact: Migration scripts may fail on real PostgreSQL (different dialect, JSONB handling) despite tests passing on SQLite.

Remediation: Add a CI step that runs alembic upgrade head against a real PostgreSQL container.

2. Scalability Risks

HIGH PRIORITY

SR-001: N+1 Query Explosion in Scoring Engine ✅ PARTIALLY RESOLVED

Current state (updated Feb 18): The worst N+1 patterns have been addressed:

scoring_service.py — bulk_technique_scores() performs 5 aggregated subqueries to fetch all scoring data in bulk, reducing organization-wide scoring from ~3,500 queries to ~5.
SATechniqueRepository.find_all_with_test_counts() — Single query with subqueries for test counts, validated test counts, and detection rule counts.
Heatmap service uses batch-fetching techniques.

Remaining: Individual technique scoring (calculate_technique_score()) still performs per-technique queries when called in isolation. create_snapshot() could benefit from using the bulk method.

Impact: Organization score calculation reduced from seconds to sub-second. Individual technique scoring unchanged.

SR-002: In-Memory Cache Does Not Scale

Current state: score_cache.py uses a Python dict with 300-second TTL. Each worker process has its own cache.

Impact: With N workers, each has a cold cache on startup and after every TTL expiration. Cache miss triggers the full org score calculation (3,500+ queries). Effectively no caching under multiple workers.

Remediation: Move cache to Redis. Invalidate granularly when tests or techniques change.

SR-003: Heatmap Endpoints Load All Techniques Without Pagination ✅ RESOLVED

Current state (updated Feb 18): The heatmap service has been extracted and optimized:

services/heatmap_service.py — Dedicated service with batch-fetching techniques (pre-aggregated test_counts, rule_counts in 2 SQL subqueries instead of N+1).
The SATechniqueRepository.find_all_with_test_counts() method provides a single-query alternative for scoring/heatmap use cases.
Router reduced from ~528 lines to a thin delegation layer.

No further action needed for query performance. The repository method can replace direct usage in remaining endpoints.

MEDIUM PRIORITY

SR-004: Reports Load Full Tables Into Memory

Current state: All 4 report endpoints load unbounded result sets:

Endpoint	Pattern
`coverage-summary`	All techniques + per-technique test count query (N+1)
`coverage-csv`	Same as above + CSV serialization in memory
`test-results`	All tests, aggregated in Python
`remediation-status`	All tests, filtered in Python

Impact: For datasets with thousands of tests, memory usage spikes. No streaming — entire response built in memory before sending.

Remediation: Use SQL aggregations. Stream CSV output. Add date range filters as required parameters.

SR-005: Operational Metrics N+1 on Audit Logs

Current state: MTTD and MTTR calculations in operational_metrics_service.py load all validated tests, then query AuditLog twice per test to find state transition timestamps.

Impact: For 500 validated tests: 1,000 audit log queries. Grows linearly with test count.

Remediation: Denormalize key timestamps onto the Test model (e.g., red_started_at, blue_started_at, remediation_completed_at) or use a single batch audit log query with window functions.

SR-006: Missing Database Indexes ✅ RESOLVED

Current state (updated Feb 18): All critical indexes are now in place:

Table	Index	Status
`tests`	`(technique_id, state)`	✅ Exists (model `__table_args__`)
`tests`	`(created_at)`, `(state, created_at)`	✅ Added in migration `b024`
`techniques`	`(tactic)`	✅ Added in migration `b026`
`techniques`	`(status_global)`	✅ Added in migration `b026`
`audit_logs`	`(entity_type, entity_id)`, `(timestamp)`, `(entity_type, entity_id, action)`	✅ Exists (model `__table_args__`)
`detection_rules`	`(mitre_technique_id)`, `(source)`, `(severity)`	✅ Exists (model `__table_args__`)

No further action needed.

LOW PRIORITY

SR-007: Single-Instance Scheduler Constraint

Current state: APScheduler runs in-process. If multiple backend instances exist, each runs its own scheduler — causing duplicate MITRE syncs, duplicate snapshots, duplicate campaign spawns.

Impact: No impact today (single instance), but blocks horizontal scaling.

Remediation: Use APScheduler PostgreSQL JobStore for distributed locking. Or migrate to Celery Beat.

SR-008: Evidence Presigned URLs Point to Internal Hostname

Current state: MinIO presigned URLs contain minio:9000 (Docker internal hostname), which is not resolvable from the user's browser.

Impact: Evidence download links fail in production unless Nginx proxies MinIO or MinIO has a public endpoint.

Remediation: Configure MINIO_EXTERNAL_ENDPOINT env var. Use it when generating presigned URLs.

3. Security Risks

HIGH PRIORITY

SEC-001: In-Memory Token Blacklist ✅ RESOLVED

Current state (updated Feb 18): The token blacklist is now Redis-backed:

infrastructure/redis_client.py — Singleton Redis connection.
auth.py — blacklist_token() and is_token_blacklisted() use Redis with TTL matching token expiration.
Shared across all workers. Survives server restarts.

No further action needed.

SEC-002: Default Credentials in Configuration ✅ RESOLVED

Current state (updated Feb 18): Production startup validation now rejects default credentials:

config.py — SECRET_KEY validation already existed; now also checks MINIO_ACCESS_KEY and MINIO_SECRET_KEY against their defaults (minioadmin). Fails fast with RuntimeError in production mode.
The install.sh script generates random passwords for production.

No further action needed.

Current state: SlowAPI rate limiting is applied only to POST /auth/login (5/minute). All other endpoints have no rate limits:

Unprotected Endpoint	Risk
`POST /users`	Bulk user creation
`POST /tests`	Resource exhaustion
`POST /system/sync-mitre`	Repeated expensive syncs
`POST /system/import-atomic-tests`	Repeated 40MB ZIP downloads
`POST /tests/{id}/evidence`	Large file upload flooding
`GET /reports/*`	Expensive report generation DoS

Impact: An authenticated attacker (or compromised account) can DoS the system by triggering expensive operations repeatedly.

Remediation: Add tiered rate limits: strict on auth, moderate on write endpoints, relaxed on read endpoints. Add specific limits on sync/import endpoints (1/hour).

MEDIUM PRIORITY

SEC-004: No Input Validation on Username ✅ RESOLVED

Current state (updated Feb 18): Username validation is now enforced:

schemas/user.py — _validate_username() function checks: 3-50 characters, only letters/digits/underscores/hyphens, rejects reserved names (admin, root, system, api, null, undefined, administrator, superuser, aegis).
Applied via @field_validator("username") on UserCreate.
9 unit tests covering valid, invalid, and reserved usernames.

No further action needed.

Current state (updated Feb 18): Login now uses constant-time comparison:

routers/auth.py — Always runs verify_password() against a dummy bcrypt hash when user is not found, ensuring consistent response time regardless of whether the username exists.

No further action needed.

SEC-006: Pydantic Validation Errors Leak Schema Details

Current state: The global validation error handler returns full Pydantic error details:

content={
    "detail": "Validation error",
    "code": "VALIDATION_ERROR",
    "errors": exc.errors(),  # Full field paths, types, constraints
}

Impact: Attackers can probe endpoints to discover internal field names, types, and validation rules.

Remediation: Sanitize error output in production. Return field names and human-readable messages only, strip internal type information.

SEC-007: No Password Complexity Requirements ✅ RESOLVED

Current state (updated Feb 18): Password complexity is enforced:

schemas/user.py — _validate_password_strength() requires: minimum 12 characters, at least one uppercase, one lowercase, one digit, one special character.
Applied on UserCreate, UserUpdate, and PasswordChange schemas.
6 unit tests covering all complexity rules.

No further action needed.

LOW PRIORITY

SEC-008: CORS Origins Not Validated in Production

Current state: CORS_ORIGINS is a comma-separated string from environment. If set to * or overly broad patterns, credentials (HttpOnly cookies) are sent to unintended origins.

Impact: Low (requires misconfiguration), but could enable cross-origin attacks.

Remediation: Validate CORS_ORIGINS at startup — reject * when AEGIS_ENV=production.

Current state: Successful logins are not audited. Failed logins are not audited. Only post-login actions are recorded.

Impact: Cannot detect brute force attacks or compromised account usage patterns.

Remediation: Log all login attempts (success/failure) to audit_logs with IP address and timestamp.

4. Maintainability Risks

HIGH PRIORITY

MR-001: No Dependency Inversion — Everything Points to Concrete Implementations ✅ PARTIALLY RESOLVED

Current state (updated Feb 18): Protocol interfaces and dependency injection now exist for core entities:

domain/ports/repositories/technique_repository.py — @runtime_checkable Protocol.
domain/ports/repositories/test_repository.py — @runtime_checkable Protocol.
dependencies/repositories.py — FastAPI Depends() wiring for SATechniqueRepository and SATestRepository.
domain/unit_of_work.py — UnitOfWork context manager for transaction control.

Remaining: Services like notification_service, audit_service, scoring_service still use direct imports. Additional ports needed for storage, notifications, and event bus.

Impact: New code follows DIP. Old code will be migrated incrementally.

MR-002: Two Coexisting Architectural Patterns

Current state: Some routers delegate to services, others do everything inline. A developer cannot predict where to find or place logic.

Pattern	Routers
Delegates to services	tests, scores, notifications, campaigns, snapshots
Direct DB queries	techniques, evidence, users, audit, reports, heatmap, metrics, detection_rules, threat_actors, data_sources, compliance

Impact: Inconsistent codebase. New developers learn one pattern and find the other. Code reviews cannot enforce a single standard.

Remediation: Establish a single pattern (all through services/use cases) and migrate incrementally.

MEDIUM PRIORITY

MR-003: No Type Checking Enforcement

Current state: tsconfig.json has strict: true for the frontend, but the backend has no mypy configuration. Python type hints exist but are never verified.

Impact: Type errors in Python code go undetected until runtime. Particularly risky for Optional fields and JSONB data.

Remediation: Add mypy to requirements. Create mypy.ini with strict settings. Add to CI pipeline.

MR-004: Test Infrastructure Uses SQLite Instead of PostgreSQL

Current state: conftest.py creates an in-memory SQLite database and patches PostgreSQL-specific types (UUID → String, JSONB → JSON):

from sqlalchemy.dialects.postgresql import UUID as PG_UUID, JSONB
# Patch to use SQLite-compatible types
sqlalchemy.dialects.postgresql.UUID = _patched_uuid
sqlalchemy.dialects.postgresql.JSONB = _patched_jsonb

Impact: Tests pass on SQLite but may fail on real PostgreSQL. JSONB-specific queries (containment @>, GIN indexes) are untestable. UUID behavior differs between dialects.

Remediation: Use testcontainers-python to spin up a real PostgreSQL container for tests, or use PostgreSQL in CI.

MR-005: Frontend Types Not Generated from Backend Schemas

Current state: types/models.ts is manually maintained and must stay in sync with schemas/*.py. There is no code generation or validation step.

Impact: Type drift between frontend and backend. A backend schema change that isn't reflected in types/models.ts causes runtime errors in the frontend.

Remediation: Generate TypeScript types from OpenAPI spec (openapi-typescript or similar). Run as a pre-build step.

LOW PRIORITY

MR-006: Documentation Scattered Across Multiple Formats

Current state: Documentation exists in README.md, docs/API.md, docs/ARCHITECTURE.md, docs/DATA_SOURCES.md, docs/SCORING.md, plus the new analysis documents. No central index or documentation site.

Impact: New developers must discover docs by browsing. No searchable documentation.

Remediation: Create a docs/INDEX.md linking all documents. Consider MkDocs or similar for a browsable doc site.

MR-007: No Conventional Commit or Changelog

Current state: No commit message convention enforced. No CHANGELOG file.

Impact: Difficult to understand what changed between releases. No automated release notes.

Remediation: Adopt Conventional Commits. Add commitlint as a pre-commit hook. Generate CHANGELOG automatically.

5. Recommended Medium-Term Improvements

Architecture

ID	Improvement	Effort	Impact	Status
IMP-001	Extract domain exceptions + error handler middleware	2-3 days	Removes FastAPI dependency from services	✅ Done
IMP-002	Create repository layer for Test, Technique, Campaign	1 week	Centralizes queries, enables caching and mocking	✅ Done (Test, Technique)
IMP-003	Extract heatmap/reports/metrics logic to application services	1-2 weeks	Thin controllers, testable business logic	✅ Heatmap done
IMP-004	Persist scoring weights in database	2-3 days	Eliminates mutable global state	Pending
IMP-005	Add domain entities with behavior (rich models)	2-3 weeks	Consolidates scattered business rules	✅ Done (Test, Technique)

Scalability

ID	Improvement	Effort	Impact	Status
IMP-006	Batch scoring queries (single SQL per metric)	1 week	Reduces org score from 3,500 queries to ~10	✅ Done
IMP-007	Add missing composite indexes	1 day	Immediate query performance improvement	✅ Done
IMP-008	Move score cache to Redis	2-3 days	Shared cache across workers	Pending
IMP-009	Batch heatmap metadata queries	2-3 days	Reduces heatmap from 1,400 to 3 queries	✅ Done
IMP-010	Denormalize MTTD/MTTR timestamps onto Test model	3-5 days	Eliminates operational metrics N+1	Pending

Security

ID	Improvement	Effort	Impact	Status
IMP-011	Move token blacklist to Redis	1-2 days	Fixes multi-instance logout	✅ Done
IMP-012	Reject default credentials in production	0.5 days	Prevents insecure deployments	✅ Done
IMP-013	Add rate limiting to write/sync endpoints	1 day	Prevents DoS from authenticated users	Pending
IMP-014	Add password complexity validation	0.5 days	Prevents weak passwords	✅ Done
IMP-015	Add login attempt auditing	1 day	Enables brute force detection	Pending

DevOps

ID	Improvement	Effort	Impact	Status
IMP-016	Create GitHub Actions CI pipeline	1-2 days	Automated lint + type check + test	✅ Done
IMP-017	Add mypy strict type checking	1-2 days	Catches type errors before runtime	Pending
IMP-018	Replace SQLite test DB with PostgreSQL (testcontainers)	1 day	Tests match production behavior	✅ CI uses PG
IMP-019	Generate frontend types from OpenAPI	0.5 days	Eliminates frontend/backend type drift	Pending
IMP-020	Add structured JSON logging	1-2 days	Production-ready observability	Pending

6. Priority Matrix

Immediate (Sprint 1 — Week 1-2)

ID	Item	Category	Effort
SEC-001	Move token blacklist to Redis	Security	1-2 days
SEC-002	Reject default credentials in production	Security	0.5 days
SR-006	Add missing database indexes	Scalability	1 day
TD-007	Replace `except: pass` with logging in workflow service	Tech Debt	0.5 days
SEC-007	Add password complexity requirements	Security	0.5 days
IMP-016	Create basic CI pipeline	DevOps	1-2 days

Total estimated effort: ~5-7 days

Short-Term (Sprint 2-3 — Week 3-6)

ID	Item	Category	Effort
TD-003	Extract domain exceptions, remove HTTPException from services	Tech Debt	2-3 days
SR-001	Batch scoring queries to eliminate N+1	Scalability	1 week
SR-003	Batch heatmap metadata queries	Scalability	2-3 days
TD-002	Create repository layer for core entities	Tech Debt	1 week
SEC-003	Add rate limiting to write/sync endpoints	Security	1 day
TD-004	Persist scoring weights in database	Tech Debt	2-3 days
SEC-009	Add login attempt auditing	Security	1 day
IMP-008	Move score cache to Redis	Scalability	2-3 days

Total estimated effort: ~3-4 weeks

Medium-Term (Month 2-3)

ID	Item	Category	Effort
TD-001	Extract heatmap/reports/metrics to application services	Tech Debt	2-3 weeks
TD-008	Expand test coverage to all routers and services	Maintainability	2-3 weeks
TD-005	Create rich domain entities (Clean Architecture Phase 2)	Tech Debt	2-3 weeks
SR-004	Optimize report endpoints (SQL aggregations, streaming)	Scalability	1 week
SR-005	Denormalize MTTD/MTTR timestamps	Scalability	3-5 days
MR-003	Add mypy type checking	Maintainability	1-2 days
MR-004	Replace SQLite tests with PostgreSQL	Maintainability	1 day
MR-005	Generate frontend types from OpenAPI	Maintainability	0.5 days
IMP-020	Structured JSON logging	DevOps	1-2 days

Total estimated effort: ~6-8 weeks

Low Priority (Backlog)

ID	Item	Category	Effort
TD-011	Add retry logic to entrypoint scripts	Tech Debt	0.5 days
TD-012	Add migration tests against real PostgreSQL	Maintainability	1 day
SR-007	APScheduler PostgreSQL JobStore for horizontal scaling	Scalability	2-3 days
SR-008	Fix MinIO presigned URL hostname for production	Scalability	1 day
SEC-008	Validate CORS origins in production	Security	0.5 days
SEC-005	Constant-time login to prevent user enumeration	Security	0.5 days
SEC-006	Sanitize Pydantic validation errors in production	Security	1 day
MR-006	Create documentation index / MkDocs site	Maintainability	1-2 days
MR-007	Adopt Conventional Commits + CHANGELOG	Maintainability	1 day
SEC-004	Username input validation	Security	0.5 days

Summary Scorecard

┌─────────────────────────────────────────────────────────────────────────────┐
│              Category          │ High │ Med │ Low │ Total│ Resolved │ Open │
│────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│
│  Technical Debt                │   5  │  4  │  2  │  11  │    4     │   7  │
│  Scalability Risks             │   3  │  3  │  2  │   8  │    3     │   5  │
│  Security Risks                │   3  │  4  │  2  │   9  │    5     │   4  │
│  Maintainability Risks         │   2  │  3  │  2  │   7  │    1     │   6  │
│────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│
│  TOTAL                         │  13  │ 14  │  8  │  35  │   13     │  22  │
└─────────────────────────────────────────────────────────────────────────────┘

Resolved: 13 of 35 items (37%)
Remaining estimated effort: ~7-9 weeks (down from ~12-15)

30 KiB Raw Blame History

Aegis — Technical Debt, Risks & Improvement Plan

Table of Contents

1. Technical Debt

HIGH PRIORITY

TD-001: Fat Controllers (Routers with Embedded Business Logic)

TD-002: No Repository Layer — Scattered Duplicate Queries ✅ PARTIALLY RESOLVED

TD-003: Services Depend on FastAPI (HTTPException in Domain Logic) ✅ RESOLVED

TD-004: Mutable Global Settings at Runtime

TD-005: Anemic Domain Models ✅ PARTIALLY RESOLVED

MEDIUM PRIORITY

TD-006: Inconsistent Error Response Format

TD-007: Silently Swallowed Exceptions in Workflow Service

TD-008: Test Suite Gaps

TD-009: No CI/CD Pipeline ✅ RESOLVED

TD-010: Unstructured Logging

LOW PRIORITY

TD-011: Entrypoint Scripts Have No Retry Logic

TD-012: No Database Migration Tests

2. Scalability Risks

HIGH PRIORITY

SR-001: N+1 Query Explosion in Scoring Engine ✅ PARTIALLY RESOLVED

SR-002: In-Memory Cache Does Not Scale

SR-003: Heatmap Endpoints Load All Techniques Without Pagination ✅ RESOLVED

MEDIUM PRIORITY

SR-004: Reports Load Full Tables Into Memory

SR-005: Operational Metrics N+1 on Audit Logs

SR-006: Missing Database Indexes ✅ RESOLVED

LOW PRIORITY

SR-007: Single-Instance Scheduler Constraint

SR-008: Evidence Presigned URLs Point to Internal Hostname

3. Security Risks

HIGH PRIORITY

SEC-001: In-Memory Token Blacklist ✅ RESOLVED

SEC-002: Default Credentials in Configuration ✅ RESOLVED

SEC-003: Rate Limiting Only on Login

MEDIUM PRIORITY

SEC-004: No Input Validation on Username ✅ RESOLVED

SEC-005: Timing-Based User Enumeration on Login ✅ RESOLVED

SEC-006: Pydantic Validation Errors Leak Schema Details

SEC-007: No Password Complexity Requirements ✅ RESOLVED

LOW PRIORITY

SEC-008: CORS Origins Not Validated in Production

SEC-009: No Audit Log for Failed Login Attempts

4. Maintainability Risks

HIGH PRIORITY

MR-001: No Dependency Inversion — Everything Points to Concrete Implementations ✅ PARTIALLY RESOLVED

MR-002: Two Coexisting Architectural Patterns

MEDIUM PRIORITY

MR-003: No Type Checking Enforcement

MR-004: Test Infrastructure Uses SQLite Instead of PostgreSQL

MR-005: Frontend Types Not Generated from Backend Schemas

LOW PRIORITY

MR-006: Documentation Scattered Across Multiple Formats

MR-007: No Conventional Commit or Changelog

5. Recommended Medium-Term Improvements

Architecture

Scalability

Security

DevOps

6. Priority Matrix

Immediate (Sprint 1 — Week 1-2)

Short-Term (Sprint 2-3 — Week 3-6)

Medium-Term (Month 2-3)

Low Priority (Backlog)

Summary Scorecard

30 KiB

Raw Blame History