# Aegis — Technical Debt, Risks & Improvement Plan > **Author:** Architecture review > **Date:** February 11, 2026 (updated February 18, 2026) > **Scope:** Backend, Frontend, Infrastructure, Security, Scalability, Maintainability > > **Note:** Items marked with ✅ have been resolved. See inline status annotations. --- ## Table of Contents 1. [Technical Debt](#1-technical-debt) 2. [Scalability Risks](#2-scalability-risks) 3. [Security Risks](#3-security-risks) 4. [Maintainability Risks](#4-maintainability-risks) 5. [Recommended Medium-Term Improvements](#5-recommended-medium-term-improvements) 6. [Priority Matrix](#6-priority-matrix) --- ## 1. Technical Debt ### HIGH PRIORITY #### TD-001: Fat Controllers (Routers with Embedded Business Logic) **Current state:** 11 of 21 routers execute raw SQLAlchemy queries directly. The worst offenders: | Router | Lines | Embedded Logic | |--------|-------|----------------| | `heatmap.py` | 528 | Query building + color mapping + ATT&CK Navigator JSON serialization + export | | `tests.py` | 664 | CRUD + template instantiation + timeline queries (workflow delegated) | | `reports.py` | 273 | Aggregation queries + CSV generation + JSON formatting | | `compliance.py` | ~350 | CRUD + import + gap analysis + CSV export | | `metrics.py` | ~316 | Complex aggregation queries with in-memory processing | **Impact:** Cannot unit test business logic without spinning up FastAPI + DB. Logic duplication across routers. Changes to one query pattern must be replicated manually in every router that uses it. **Remediation:** Extract query logic to service/repository layer. Each router endpoint should be < 20 lines. --- #### TD-002: No Repository Layer — Scattered Duplicate Queries ✅ PARTIALLY RESOLVED **Current state (updated Feb 18):** Repository ports (Protocol interfaces) and SQLAlchemy implementations now exist for `Technique` and `Test`: - `domain/ports/repositories/technique_repository.py` — Protocol with `find_by_id()`, `find_by_mitre_id()`, `list_all()`, `count_by_status()`, `find_all_with_test_counts()`, `save()`, etc. - `domain/ports/repositories/test_repository.py` — Protocol with `find_by_id()`, `list_by_technique()`, `get_states_and_results_for_technique()`, etc. - `infrastructure/persistence/repositories/sa_technique_repository.py` — Concrete implementation with batch queries (eliminates N+1 for heatmap/scoring). - `infrastructure/persistence/repositories/sa_test_repository.py` — Concrete implementation. - `dependencies/repositories.py` — FastAPI `Depends()` wiring. **Remaining:** Old routers still use direct `db.query()`. New endpoints should use repositories; existing endpoints will be migrated incrementally. **Impact:** New code has centralized query management. Old queries still scattered but coexist safely. --- #### TD-003: Services Depend on FastAPI (HTTPException in Domain Logic) ✅ RESOLVED **Current state (updated Feb 18):** Domain exceptions have been implemented and are in active use: - `domain/errors.py` — Full exception hierarchy: `DomainError`, `EntityNotFoundError`, `DuplicateEntityError`, `InvalidStateTransition`, `BusinessRuleViolation`, `InvalidOperationError`, `PermissionViolation`. - `domain/exceptions.py` — Backward-compatible re-exports. - `middleware/error_handler.py` — Maps domain exceptions to HTTP responses automatically (404, 409, 400, 403). - `test_workflow_service.py` — Now raises `InvalidOperationError` and `InvalidStateTransition` instead of `HTTPException`. **No further action needed** for the core services. Some secondary routers may still raise `HTTPException` directly (which is acceptable at the presentation layer). --- #### TD-004: Mutable Global Settings at Runtime **Current state:** The `scores.py` router mutates `settings` directly: ```python settings.SCORING_WEIGHT_TESTS = body.weight_tests settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules ``` **Impact:** Changes lost on restart. Thread-unsafe with multiple workers. No audit trail for config changes. **Remediation:** Persist scoring weights in the database. Create a `ScoringConfig` table. Load weights from DB in scoring_service. --- #### TD-005: Anemic Domain Models ✅ PARTIALLY RESOLVED **Current state (updated Feb 18):** Rich domain entities now exist alongside the ORM models: - `domain/test_entity.py` — `TestEntity` dataclass with full state machine (`can_transition()`, `transition_to()`, `start_execution()`, `submit_red_evidence()`, `submit_blue_evidence()`, `validate()`, `reopen()`), dual validation, pause/resume timers, and domain events. Comprehensive unit tests (46 tests). - `domain/entities/technique.py` — `TechniqueEntity` with `recalculate_status()`, `mark_reviewed()`, `flag_for_review()`, `create()`, `from_orm()`/`apply_to()`. Comprehensive unit tests (16 tests). - `domain/value_objects/mitre_id.py` — Immutable value object with ATT&CK ID validation. - `domain/value_objects/scoring_weights.py` — Immutable weight set enforcing sum-to-100. **ORM models remain anemic** (by design — they are persistence mapping only). Business logic lives in domain entities, bridged via `from_orm()`/`apply_to()`. **Remaining:** Campaign, ComplianceFramework, and other entities still lack domain entity counterparts. --- ### MEDIUM PRIORITY #### TD-006: Inconsistent Error Response Format **Current state:** API error responses use three different formats: | Format | Used In | |--------|---------| | `detail: "string"` | Most routers (`techniques.py`, `users.py`, `evidence.py`) | | `detail: {message, code, ...}` | `tests.py`, `test_workflow_service.py` | | `detail: "Validation error", code: "VALIDATION_ERROR", errors: [...]` | Global handler in `main.py` | **Impact:** Frontend must handle multiple error shapes. No reliable error code for programmatic handling. **Remediation:** Standardize all errors to `{detail: string, code: string, errors?: [...]}`. --- #### TD-007: Silently Swallowed Exceptions in Workflow Service **Current state:** `test_workflow_service.py` has 4 bare `except Exception: pass` blocks: | Line | What is swallowed | |------|-------------------| | 106 | `notify_test_state_change()` failure | | 286 | Notification failure | | 295 | Notification failure | | 299 | Score cache invalidation failure | **Impact:** Notification failures and cache invalidation errors go completely unnoticed. Users may miss critical workflow notifications with no trace in logs. **Remediation:** Replace `pass` with `logger.warning(...)` at minimum. Consider async event dispatch so failures don't block the main flow. --- #### TD-008: Test Suite Gaps **Current state:** ~167 test functions across 18 files, but coverage is uneven: | Category | Covered | Not Covered | |----------|---------|-------------| | **Routers** | auth, techniques, tests, evidence, test_templates, metrics, system | audit, campaigns, compliance, d3fend, detection_rules, heatmap, operational_metrics, scores, snapshots, threat_actors, users | | **Services** | workflow, status, atomic_import, campaign, scoring, notifications | audit, caldera, compliance_import, d3fend, elastic, intel, lolbas, mitre_sync, score_cache, sigma, threat_actor_import | 4 integration tests are `pytest.skip`ped by default (Sigma, LOLBAS, CALDERA, Elastic full imports). Some tests use `inspect.getsource()` to verify code structure rather than actually calling endpoints. **Impact:** Regressions in untested routers/services go undetected. No security-focused tests (injection, rate limiting, CSRF). **Remediation:** Add integration tests for all routers. Add dedicated security test suite. Run skipped integration tests in CI. --- #### TD-009: No CI/CD Pipeline ✅ RESOLVED **Current state (updated Feb 18):** A fully functional CI pipeline exists at `.github/workflows/ci.yml`: - Runs `ruff` linting on every push/PR. - Runs `pytest` against a real PostgreSQL + Redis service container. - Tests run against the same stack as production (not SQLite). Additionally, `scripts/agent_validate_backend.sh` provides a local validation script that runs lint + tests inside the Docker container. **No further action needed** for basic CI. Potential enhancements: add `mypy` type checking, Docker build verification. --- #### TD-010: Unstructured Logging **Current state:** Logging uses plain format strings with no structured fields: ```python logging.basicConfig( level=logging.INFO, format="%(asctime)s %(levelname)-8s %(name)s — %(message)s", ) ``` Global exception handlers use `logging.error(f"...")` instead of a logger instance. No request ID, user ID, or correlation ID in log output. **Impact:** Cannot query logs for "all actions by user X" or "all errors in request Y". Log analysis in production requires manual grep. **Remediation:** Add structured JSON logging (e.g., `structlog` or `python-json-logger`). Include request_id middleware. --- ### LOW PRIORITY #### TD-011: Entrypoint Scripts Have No Retry Logic **Current state:** Both `entrypoint.sh` and `entrypoint.prod.sh` use `set -e`. If `alembic upgrade head` or `python -m app.seed` fails, Uvicorn never starts. No retry, no clear error message. **Impact:** Transient DB connection failures during container startup cause the backend to fail permanently until manually restarted (Docker `restart: always` will retry, but seed may fail repeatedly). **Remediation:** Add retry loop for migration with backoff. Make seed idempotent and non-fatal. --- #### TD-012: No Database Migration Tests **Current state:** Alembic migrations (18 versions) are never tested in isolation. The test suite uses in-memory SQLite with tables created from models, bypassing Alembic entirely. **Impact:** Migration scripts may fail on real PostgreSQL (different dialect, JSONB handling) despite tests passing on SQLite. **Remediation:** Add a CI step that runs `alembic upgrade head` against a real PostgreSQL container. --- ## 2. Scalability Risks ### HIGH PRIORITY #### SR-001: N+1 Query Explosion in Scoring Engine ✅ PARTIALLY RESOLVED **Current state (updated Feb 18):** The worst N+1 patterns have been addressed: - `scoring_service.py` — `bulk_technique_scores()` performs 5 aggregated subqueries to fetch all scoring data in bulk, reducing organization-wide scoring from ~3,500 queries to ~5. - `SATechniqueRepository.find_all_with_test_counts()` — Single query with subqueries for test counts, validated test counts, and detection rule counts. - Heatmap service uses batch-fetching techniques. **Remaining:** Individual technique scoring (`calculate_technique_score()`) still performs per-technique queries when called in isolation. `create_snapshot()` could benefit from using the bulk method. **Impact:** Organization score calculation reduced from seconds to sub-second. Individual technique scoring unchanged. --- #### SR-002: In-Memory Cache Does Not Scale **Current state:** `score_cache.py` uses a Python dict with 300-second TTL. Each worker process has its own cache. **Impact:** With N workers, each has a cold cache on startup and after every TTL expiration. Cache miss triggers the full org score calculation (3,500+ queries). Effectively no caching under multiple workers. **Remediation:** Move cache to Redis. Invalidate granularly when tests or techniques change. --- #### SR-003: Heatmap Endpoints Load All Techniques Without Pagination ✅ RESOLVED **Current state (updated Feb 18):** The heatmap service has been extracted and optimized: - `services/heatmap_service.py` — Dedicated service with batch-fetching techniques (pre-aggregated `test_counts`, `rule_counts` in 2 SQL subqueries instead of N+1). - The `SATechniqueRepository.find_all_with_test_counts()` method provides a single-query alternative for scoring/heatmap use cases. - Router reduced from ~528 lines to a thin delegation layer. **No further action needed** for query performance. The repository method can replace direct usage in remaining endpoints. --- ### MEDIUM PRIORITY #### SR-004: Reports Load Full Tables Into Memory **Current state:** All 4 report endpoints load unbounded result sets: | Endpoint | Pattern | |----------|---------| | `coverage-summary` | All techniques + per-technique test count query (N+1) | | `coverage-csv` | Same as above + CSV serialization in memory | | `test-results` | All tests, aggregated in Python | | `remediation-status` | All tests, filtered in Python | **Impact:** For datasets with thousands of tests, memory usage spikes. No streaming — entire response built in memory before sending. **Remediation:** Use SQL aggregations. Stream CSV output. Add date range filters as required parameters. --- #### SR-005: Operational Metrics N+1 on Audit Logs **Current state:** MTTD and MTTR calculations in `operational_metrics_service.py` load all validated tests, then query `AuditLog` twice per test to find state transition timestamps. **Impact:** For 500 validated tests: 1,000 audit log queries. Grows linearly with test count. **Remediation:** Denormalize key timestamps onto the Test model (e.g., `red_started_at`, `blue_started_at`, `remediation_completed_at`) or use a single batch audit log query with window functions. --- #### SR-006: Missing Database Indexes ✅ RESOLVED **Current state (updated Feb 18):** All critical indexes are now in place: | Table | Index | Status | |-------|-------|--------| | `tests` | `(technique_id, state)` | ✅ Exists (model `__table_args__`) | | `tests` | `(created_at)`, `(state, created_at)` | ✅ Added in migration `b024` | | `techniques` | `(tactic)` | ✅ Added in migration `b026` | | `techniques` | `(status_global)` | ✅ Added in migration `b026` | | `audit_logs` | `(entity_type, entity_id)`, `(timestamp)`, `(entity_type, entity_id, action)` | ✅ Exists (model `__table_args__`) | | `detection_rules` | `(mitre_technique_id)`, `(source)`, `(severity)` | ✅ Exists (model `__table_args__`) | **No further action needed.** --- ### LOW PRIORITY #### SR-007: Single-Instance Scheduler Constraint **Current state:** APScheduler runs in-process. If multiple backend instances exist, each runs its own scheduler — causing duplicate MITRE syncs, duplicate snapshots, duplicate campaign spawns. **Impact:** No impact today (single instance), but blocks horizontal scaling. **Remediation:** Use APScheduler PostgreSQL JobStore for distributed locking. Or migrate to Celery Beat. --- #### SR-008: Evidence Presigned URLs Point to Internal Hostname **Current state:** MinIO presigned URLs contain `minio:9000` (Docker internal hostname), which is not resolvable from the user's browser. **Impact:** Evidence download links fail in production unless Nginx proxies MinIO or MinIO has a public endpoint. **Remediation:** Configure `MINIO_EXTERNAL_ENDPOINT` env var. Use it when generating presigned URLs. --- ## 3. Security Risks ### HIGH PRIORITY #### SEC-001: In-Memory Token Blacklist ✅ RESOLVED **Current state (updated Feb 18):** The token blacklist is now Redis-backed: - `infrastructure/redis_client.py` — Singleton Redis connection. - `auth.py` — `blacklist_token()` and `is_token_blacklisted()` use Redis with TTL matching token expiration. - Shared across all workers. Survives server restarts. **No further action needed.** --- #### SEC-002: Default Credentials in Configuration ✅ RESOLVED **Current state (updated Feb 18):** Production startup validation now rejects default credentials: - `config.py` — `SECRET_KEY` validation already existed; now also checks `MINIO_ACCESS_KEY` and `MINIO_SECRET_KEY` against their defaults (`minioadmin`). Fails fast with `RuntimeError` in production mode. - The `install.sh` script generates random passwords for production. **No further action needed.** --- #### SEC-003: Rate Limiting Only on Login **Current state:** SlowAPI rate limiting is applied only to `POST /auth/login` (5/minute). All other endpoints have no rate limits: | Unprotected Endpoint | Risk | |---------------------|------| | `POST /users` | Bulk user creation | | `POST /tests` | Resource exhaustion | | `POST /system/sync-mitre` | Repeated expensive syncs | | `POST /system/import-atomic-tests` | Repeated 40MB ZIP downloads | | `POST /tests/{id}/evidence` | Large file upload flooding | | `GET /reports/*` | Expensive report generation DoS | **Impact:** An authenticated attacker (or compromised account) can DoS the system by triggering expensive operations repeatedly. **Remediation:** Add tiered rate limits: strict on auth, moderate on write endpoints, relaxed on read endpoints. Add specific limits on sync/import endpoints (1/hour). --- ### MEDIUM PRIORITY #### SEC-004: No Input Validation on Username ✅ RESOLVED **Current state (updated Feb 18):** Username validation is now enforced: - `schemas/user.py` — `_validate_username()` function checks: 3-50 characters, only letters/digits/underscores/hyphens, rejects reserved names (`admin`, `root`, `system`, `api`, `null`, `undefined`, `administrator`, `superuser`, `aegis`). - Applied via `@field_validator("username")` on `UserCreate`. - 9 unit tests covering valid, invalid, and reserved usernames. **No further action needed.** --- #### SEC-005: Timing-Based User Enumeration on Login ✅ RESOLVED **Current state (updated Feb 18):** Login now uses constant-time comparison: - `routers/auth.py` — Always runs `verify_password()` against a dummy bcrypt hash when user is not found, ensuring consistent response time regardless of whether the username exists. **No further action needed.** --- #### SEC-006: Pydantic Validation Errors Leak Schema Details **Current state:** The global validation error handler returns full Pydantic error details: ```python content={ "detail": "Validation error", "code": "VALIDATION_ERROR", "errors": exc.errors(), # Full field paths, types, constraints } ``` **Impact:** Attackers can probe endpoints to discover internal field names, types, and validation rules. **Remediation:** Sanitize error output in production. Return field names and human-readable messages only, strip internal type information. --- #### SEC-007: No Password Complexity Requirements ✅ RESOLVED **Current state (updated Feb 18):** Password complexity is enforced: - `schemas/user.py` — `_validate_password_strength()` requires: minimum 12 characters, at least one uppercase, one lowercase, one digit, one special character. - Applied on `UserCreate`, `UserUpdate`, and `PasswordChange` schemas. - 6 unit tests covering all complexity rules. **No further action needed.** --- ### LOW PRIORITY #### SEC-008: CORS Origins Not Validated in Production **Current state:** `CORS_ORIGINS` is a comma-separated string from environment. If set to `*` or overly broad patterns, credentials (HttpOnly cookies) are sent to unintended origins. **Impact:** Low (requires misconfiguration), but could enable cross-origin attacks. **Remediation:** Validate `CORS_ORIGINS` at startup — reject `*` when `AEGIS_ENV=production`. --- #### SEC-009: No Audit Log for Failed Login Attempts **Current state:** Successful logins are not audited. Failed logins are not audited. Only post-login actions are recorded. **Impact:** Cannot detect brute force attacks or compromised account usage patterns. **Remediation:** Log all login attempts (success/failure) to audit_logs with IP address and timestamp. --- ## 4. Maintainability Risks ### HIGH PRIORITY #### MR-001: No Dependency Inversion — Everything Points to Concrete Implementations ✅ PARTIALLY RESOLVED **Current state (updated Feb 18):** Protocol interfaces and dependency injection now exist for core entities: - `domain/ports/repositories/technique_repository.py` — `@runtime_checkable` Protocol. - `domain/ports/repositories/test_repository.py` — `@runtime_checkable` Protocol. - `dependencies/repositories.py` — FastAPI `Depends()` wiring for `SATechniqueRepository` and `SATestRepository`. - `domain/unit_of_work.py` — `UnitOfWork` context manager for transaction control. **Remaining:** Services like `notification_service`, `audit_service`, `scoring_service` still use direct imports. Additional ports needed for storage, notifications, and event bus. **Impact:** New code follows DIP. Old code will be migrated incrementally. --- #### MR-002: Two Coexisting Architectural Patterns **Current state:** Some routers delegate to services, others do everything inline. A developer cannot predict where to find or place logic. | Pattern | Routers | |---------|---------| | Delegates to services | tests, scores, notifications, campaigns, snapshots | | Direct DB queries | techniques, evidence, users, audit, reports, heatmap, metrics, detection_rules, threat_actors, data_sources, compliance | **Impact:** Inconsistent codebase. New developers learn one pattern and find the other. Code reviews cannot enforce a single standard. **Remediation:** Establish a single pattern (all through services/use cases) and migrate incrementally. --- ### MEDIUM PRIORITY #### MR-003: No Type Checking Enforcement **Current state:** `tsconfig.json` has `strict: true` for the frontend, but the backend has no `mypy` configuration. Python type hints exist but are never verified. **Impact:** Type errors in Python code go undetected until runtime. Particularly risky for Optional fields and JSONB data. **Remediation:** Add `mypy` to requirements. Create `mypy.ini` with strict settings. Add to CI pipeline. --- #### MR-004: Test Infrastructure Uses SQLite Instead of PostgreSQL **Current state:** `conftest.py` creates an in-memory SQLite database and patches PostgreSQL-specific types (UUID → String, JSONB → JSON): ```python from sqlalchemy.dialects.postgresql import UUID as PG_UUID, JSONB # Patch to use SQLite-compatible types sqlalchemy.dialects.postgresql.UUID = _patched_uuid sqlalchemy.dialects.postgresql.JSONB = _patched_jsonb ``` **Impact:** Tests pass on SQLite but may fail on real PostgreSQL. JSONB-specific queries (containment `@>`, GIN indexes) are untestable. UUID behavior differs between dialects. **Remediation:** Use `testcontainers-python` to spin up a real PostgreSQL container for tests, or use PostgreSQL in CI. --- #### MR-005: Frontend Types Not Generated from Backend Schemas **Current state:** `types/models.ts` is manually maintained and must stay in sync with `schemas/*.py`. There is no code generation or validation step. **Impact:** Type drift between frontend and backend. A backend schema change that isn't reflected in `types/models.ts` causes runtime errors in the frontend. **Remediation:** Generate TypeScript types from OpenAPI spec (`openapi-typescript` or similar). Run as a pre-build step. --- ### LOW PRIORITY #### MR-006: Documentation Scattered Across Multiple Formats **Current state:** Documentation exists in `README.md`, `docs/API.md`, `docs/ARCHITECTURE.md`, `docs/DATA_SOURCES.md`, `docs/SCORING.md`, plus the new analysis documents. No central index or documentation site. **Impact:** New developers must discover docs by browsing. No searchable documentation. **Remediation:** Create a `docs/INDEX.md` linking all documents. Consider MkDocs or similar for a browsable doc site. --- #### MR-007: No Conventional Commit or Changelog **Current state:** No commit message convention enforced. No CHANGELOG file. **Impact:** Difficult to understand what changed between releases. No automated release notes. **Remediation:** Adopt Conventional Commits. Add commitlint as a pre-commit hook. Generate CHANGELOG automatically. --- ## 5. Recommended Medium-Term Improvements ### Architecture | ID | Improvement | Effort | Impact | Status | |----|-------------|--------|--------|--------| | IMP-001 | Extract domain exceptions + error handler middleware | 2-3 days | Removes FastAPI dependency from services | ✅ Done | | IMP-002 | Create repository layer for Test, Technique, Campaign | 1 week | Centralizes queries, enables caching and mocking | ✅ Done (Test, Technique) | | IMP-003 | Extract heatmap/reports/metrics logic to application services | 1-2 weeks | Thin controllers, testable business logic | ✅ Heatmap done | | IMP-004 | Persist scoring weights in database | 2-3 days | Eliminates mutable global state | Pending | | IMP-005 | Add domain entities with behavior (rich models) | 2-3 weeks | Consolidates scattered business rules | ✅ Done (Test, Technique) | ### Scalability | ID | Improvement | Effort | Impact | Status | |----|-------------|--------|--------|--------| | IMP-006 | Batch scoring queries (single SQL per metric) | 1 week | Reduces org score from 3,500 queries to ~10 | ✅ Done | | IMP-007 | Add missing composite indexes | 1 day | Immediate query performance improvement | ✅ Done | | IMP-008 | Move score cache to Redis | 2-3 days | Shared cache across workers | Pending | | IMP-009 | Batch heatmap metadata queries | 2-3 days | Reduces heatmap from 1,400 to 3 queries | ✅ Done | | IMP-010 | Denormalize MTTD/MTTR timestamps onto Test model | 3-5 days | Eliminates operational metrics N+1 | Pending | ### Security | ID | Improvement | Effort | Impact | Status | |----|-------------|--------|--------|--------| | IMP-011 | Move token blacklist to Redis | 1-2 days | Fixes multi-instance logout | ✅ Done | | IMP-012 | Reject default credentials in production | 0.5 days | Prevents insecure deployments | ✅ Done | | IMP-013 | Add rate limiting to write/sync endpoints | 1 day | Prevents DoS from authenticated users | Pending | | IMP-014 | Add password complexity validation | 0.5 days | Prevents weak passwords | ✅ Done | | IMP-015 | Add login attempt auditing | 1 day | Enables brute force detection | Pending | ### DevOps | ID | Improvement | Effort | Impact | Status | |----|-------------|--------|--------|--------| | IMP-016 | Create GitHub Actions CI pipeline | 1-2 days | Automated lint + type check + test | ✅ Done | | IMP-017 | Add mypy strict type checking | 1-2 days | Catches type errors before runtime | Pending | | IMP-018 | Replace SQLite test DB with PostgreSQL (testcontainers) | 1 day | Tests match production behavior | ✅ CI uses PG | | IMP-019 | Generate frontend types from OpenAPI | 0.5 days | Eliminates frontend/backend type drift | Pending | | IMP-020 | Add structured JSON logging | 1-2 days | Production-ready observability | Pending | --- ## 6. Priority Matrix ### Immediate (Sprint 1 — Week 1-2) | ID | Item | Category | Effort | |----|------|----------|--------| | SEC-001 | Move token blacklist to Redis | Security | 1-2 days | | SEC-002 | Reject default credentials in production | Security | 0.5 days | | SR-006 | Add missing database indexes | Scalability | 1 day | | TD-007 | Replace `except: pass` with logging in workflow service | Tech Debt | 0.5 days | | SEC-007 | Add password complexity requirements | Security | 0.5 days | | IMP-016 | Create basic CI pipeline | DevOps | 1-2 days | **Total estimated effort: ~5-7 days** --- ### Short-Term (Sprint 2-3 — Week 3-6) | ID | Item | Category | Effort | |----|------|----------|--------| | TD-003 | Extract domain exceptions, remove HTTPException from services | Tech Debt | 2-3 days | | SR-001 | Batch scoring queries to eliminate N+1 | Scalability | 1 week | | SR-003 | Batch heatmap metadata queries | Scalability | 2-3 days | | TD-002 | Create repository layer for core entities | Tech Debt | 1 week | | SEC-003 | Add rate limiting to write/sync endpoints | Security | 1 day | | TD-004 | Persist scoring weights in database | Tech Debt | 2-3 days | | SEC-009 | Add login attempt auditing | Security | 1 day | | IMP-008 | Move score cache to Redis | Scalability | 2-3 days | **Total estimated effort: ~3-4 weeks** --- ### Medium-Term (Month 2-3) | ID | Item | Category | Effort | |----|------|----------|--------| | TD-001 | Extract heatmap/reports/metrics to application services | Tech Debt | 2-3 weeks | | TD-008 | Expand test coverage to all routers and services | Maintainability | 2-3 weeks | | TD-005 | Create rich domain entities (Clean Architecture Phase 2) | Tech Debt | 2-3 weeks | | SR-004 | Optimize report endpoints (SQL aggregations, streaming) | Scalability | 1 week | | SR-005 | Denormalize MTTD/MTTR timestamps | Scalability | 3-5 days | | MR-003 | Add mypy type checking | Maintainability | 1-2 days | | MR-004 | Replace SQLite tests with PostgreSQL | Maintainability | 1 day | | MR-005 | Generate frontend types from OpenAPI | Maintainability | 0.5 days | | IMP-020 | Structured JSON logging | DevOps | 1-2 days | **Total estimated effort: ~6-8 weeks** --- ### Low Priority (Backlog) | ID | Item | Category | Effort | |----|------|----------|--------| | TD-011 | Add retry logic to entrypoint scripts | Tech Debt | 0.5 days | | TD-012 | Add migration tests against real PostgreSQL | Maintainability | 1 day | | SR-007 | APScheduler PostgreSQL JobStore for horizontal scaling | Scalability | 2-3 days | | SR-008 | Fix MinIO presigned URL hostname for production | Scalability | 1 day | | SEC-008 | Validate CORS origins in production | Security | 0.5 days | | SEC-005 | Constant-time login to prevent user enumeration | Security | 0.5 days | | SEC-006 | Sanitize Pydantic validation errors in production | Security | 1 day | | MR-006 | Create documentation index / MkDocs site | Maintainability | 1-2 days | | MR-007 | Adopt Conventional Commits + CHANGELOG | Maintainability | 1 day | | SEC-004 | Username input validation | Security | 0.5 days | --- ## Summary Scorecard ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Category │ High │ Med │ Low │ Total│ Resolved │ Open │ │────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│ │ Technical Debt │ 5 │ 4 │ 2 │ 11 │ 4 │ 7 │ │ Scalability Risks │ 3 │ 3 │ 2 │ 8 │ 3 │ 5 │ │ Security Risks │ 3 │ 4 │ 2 │ 9 │ 5 │ 4 │ │ Maintainability Risks │ 2 │ 3 │ 2 │ 7 │ 1 │ 6 │ │────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│ │ TOTAL │ 13 │ 14 │ 8 │ 35 │ 13 │ 22 │ └─────────────────────────────────────────────────────────────────────────────┘ Resolved: 13 of 35 items (37%) Remaining estimated effort: ~7-9 weeks (down from ~12-15) ```