Files
Aegis/docs/TECH_DEBT_AND_RISKS.md
Kitos 0b65f51d1c
Some checks failed
Aegis CI / lint-and-test (push) Has been cancelled
docs: update architecture analysis and tech debt docs to reflect resolved items
2026-02-18 19:27:52 +01:00

655 lines
30 KiB
Markdown

# Aegis — Technical Debt, Risks & Improvement Plan
> **Author:** Architecture review
> **Date:** February 11, 2026 (updated February 18, 2026)
> **Scope:** Backend, Frontend, Infrastructure, Security, Scalability, Maintainability
>
> **Note:** Items marked with ✅ have been resolved. See inline status annotations.
---
## Table of Contents
1. [Technical Debt](#1-technical-debt)
2. [Scalability Risks](#2-scalability-risks)
3. [Security Risks](#3-security-risks)
4. [Maintainability Risks](#4-maintainability-risks)
5. [Recommended Medium-Term Improvements](#5-recommended-medium-term-improvements)
6. [Priority Matrix](#6-priority-matrix)
---
## 1. Technical Debt
### HIGH PRIORITY
#### TD-001: Fat Controllers (Routers with Embedded Business Logic)
**Current state:** 11 of 21 routers execute raw SQLAlchemy queries directly. The worst offenders:
| Router | Lines | Embedded Logic |
|--------|-------|----------------|
| `heatmap.py` | 528 | Query building + color mapping + ATT&CK Navigator JSON serialization + export |
| `tests.py` | 664 | CRUD + template instantiation + timeline queries (workflow delegated) |
| `reports.py` | 273 | Aggregation queries + CSV generation + JSON formatting |
| `compliance.py` | ~350 | CRUD + import + gap analysis + CSV export |
| `metrics.py` | ~316 | Complex aggregation queries with in-memory processing |
**Impact:** Cannot unit test business logic without spinning up FastAPI + DB. Logic duplication across routers. Changes to one query pattern must be replicated manually in every router that uses it.
**Remediation:** Extract query logic to service/repository layer. Each router endpoint should be < 20 lines.
---
#### TD-002: No Repository Layer — Scattered Duplicate Queries ✅ PARTIALLY RESOLVED
**Current state (updated Feb 18):** Repository ports (Protocol interfaces) and SQLAlchemy implementations now exist for `Technique` and `Test`:
- `domain/ports/repositories/technique_repository.py` — Protocol with `find_by_id()`, `find_by_mitre_id()`, `list_all()`, `count_by_status()`, `find_all_with_test_counts()`, `save()`, etc.
- `domain/ports/repositories/test_repository.py` — Protocol with `find_by_id()`, `list_by_technique()`, `get_states_and_results_for_technique()`, etc.
- `infrastructure/persistence/repositories/sa_technique_repository.py` — Concrete implementation with batch queries (eliminates N+1 for heatmap/scoring).
- `infrastructure/persistence/repositories/sa_test_repository.py` — Concrete implementation.
- `dependencies/repositories.py` — FastAPI `Depends()` wiring.
**Remaining:** Old routers still use direct `db.query()`. New endpoints should use repositories; existing endpoints will be migrated incrementally.
**Impact:** New code has centralized query management. Old queries still scattered but coexist safely.
---
#### TD-003: Services Depend on FastAPI (HTTPException in Domain Logic) ✅ RESOLVED
**Current state (updated Feb 18):** Domain exceptions have been implemented and are in active use:
- `domain/errors.py` — Full exception hierarchy: `DomainError`, `EntityNotFoundError`, `DuplicateEntityError`, `InvalidStateTransition`, `BusinessRuleViolation`, `InvalidOperationError`, `PermissionViolation`.
- `domain/exceptions.py` — Backward-compatible re-exports.
- `middleware/error_handler.py` — Maps domain exceptions to HTTP responses automatically (404, 409, 400, 403).
- `test_workflow_service.py` — Now raises `InvalidOperationError` and `InvalidStateTransition` instead of `HTTPException`.
**No further action needed** for the core services. Some secondary routers may still raise `HTTPException` directly (which is acceptable at the presentation layer).
---
#### TD-004: Mutable Global Settings at Runtime
**Current state:** The `scores.py` router mutates `settings` directly:
```python
settings.SCORING_WEIGHT_TESTS = body.weight_tests
settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules
```
**Impact:** Changes lost on restart. Thread-unsafe with multiple workers. No audit trail for config changes.
**Remediation:** Persist scoring weights in the database. Create a `ScoringConfig` table. Load weights from DB in scoring_service.
---
#### TD-005: Anemic Domain Models ✅ PARTIALLY RESOLVED
**Current state (updated Feb 18):** Rich domain entities now exist alongside the ORM models:
- `domain/test_entity.py``TestEntity` dataclass with full state machine (`can_transition()`, `transition_to()`, `start_execution()`, `submit_red_evidence()`, `submit_blue_evidence()`, `validate()`, `reopen()`), dual validation, pause/resume timers, and domain events. Comprehensive unit tests (46 tests).
- `domain/entities/technique.py``TechniqueEntity` with `recalculate_status()`, `mark_reviewed()`, `flag_for_review()`, `create()`, `from_orm()`/`apply_to()`. Comprehensive unit tests (16 tests).
- `domain/value_objects/mitre_id.py` — Immutable value object with ATT&CK ID validation.
- `domain/value_objects/scoring_weights.py` — Immutable weight set enforcing sum-to-100.
**ORM models remain anemic** (by design — they are persistence mapping only). Business logic lives in domain entities, bridged via `from_orm()`/`apply_to()`.
**Remaining:** Campaign, ComplianceFramework, and other entities still lack domain entity counterparts.
---
### MEDIUM PRIORITY
#### TD-006: Inconsistent Error Response Format
**Current state:** API error responses use three different formats:
| Format | Used In |
|--------|---------|
| `detail: "string"` | Most routers (`techniques.py`, `users.py`, `evidence.py`) |
| `detail: {message, code, ...}` | `tests.py`, `test_workflow_service.py` |
| `detail: "Validation error", code: "VALIDATION_ERROR", errors: [...]` | Global handler in `main.py` |
**Impact:** Frontend must handle multiple error shapes. No reliable error code for programmatic handling.
**Remediation:** Standardize all errors to `{detail: string, code: string, errors?: [...]}`.
---
#### TD-007: Silently Swallowed Exceptions in Workflow Service
**Current state:** `test_workflow_service.py` has 4 bare `except Exception: pass` blocks:
| Line | What is swallowed |
|------|-------------------|
| 106 | `notify_test_state_change()` failure |
| 286 | Notification failure |
| 295 | Notification failure |
| 299 | Score cache invalidation failure |
**Impact:** Notification failures and cache invalidation errors go completely unnoticed. Users may miss critical workflow notifications with no trace in logs.
**Remediation:** Replace `pass` with `logger.warning(...)` at minimum. Consider async event dispatch so failures don't block the main flow.
---
#### TD-008: Test Suite Gaps
**Current state:** ~167 test functions across 18 files, but coverage is uneven:
| Category | Covered | Not Covered |
|----------|---------|-------------|
| **Routers** | auth, techniques, tests, evidence, test_templates, metrics, system | audit, campaigns, compliance, d3fend, detection_rules, heatmap, operational_metrics, scores, snapshots, threat_actors, users |
| **Services** | workflow, status, atomic_import, campaign, scoring, notifications | audit, caldera, compliance_import, d3fend, elastic, intel, lolbas, mitre_sync, score_cache, sigma, threat_actor_import |
4 integration tests are `pytest.skip`ped by default (Sigma, LOLBAS, CALDERA, Elastic full imports).
Some tests use `inspect.getsource()` to verify code structure rather than actually calling endpoints.
**Impact:** Regressions in untested routers/services go undetected. No security-focused tests (injection, rate limiting, CSRF).
**Remediation:** Add integration tests for all routers. Add dedicated security test suite. Run skipped integration tests in CI.
---
#### TD-009: No CI/CD Pipeline ✅ RESOLVED
**Current state (updated Feb 18):** A fully functional CI pipeline exists at `.github/workflows/ci.yml`:
- Runs `ruff` linting on every push/PR.
- Runs `pytest` against a real PostgreSQL + Redis service container.
- Tests run against the same stack as production (not SQLite).
Additionally, `scripts/agent_validate_backend.sh` provides a local validation script that runs lint + tests inside the Docker container.
**No further action needed** for basic CI. Potential enhancements: add `mypy` type checking, Docker build verification.
---
#### TD-010: Unstructured Logging
**Current state:** Logging uses plain format strings with no structured fields:
```python
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)-8s %(name)s%(message)s",
)
```
Global exception handlers use `logging.error(f"...")` instead of a logger instance. No request ID, user ID, or correlation ID in log output.
**Impact:** Cannot query logs for "all actions by user X" or "all errors in request Y". Log analysis in production requires manual grep.
**Remediation:** Add structured JSON logging (e.g., `structlog` or `python-json-logger`). Include request_id middleware.
---
### LOW PRIORITY
#### TD-011: Entrypoint Scripts Have No Retry Logic
**Current state:** Both `entrypoint.sh` and `entrypoint.prod.sh` use `set -e`. If `alembic upgrade head` or `python -m app.seed` fails, Uvicorn never starts. No retry, no clear error message.
**Impact:** Transient DB connection failures during container startup cause the backend to fail permanently until manually restarted (Docker `restart: always` will retry, but seed may fail repeatedly).
**Remediation:** Add retry loop for migration with backoff. Make seed idempotent and non-fatal.
---
#### TD-012: No Database Migration Tests
**Current state:** Alembic migrations (18 versions) are never tested in isolation. The test suite uses in-memory SQLite with tables created from models, bypassing Alembic entirely.
**Impact:** Migration scripts may fail on real PostgreSQL (different dialect, JSONB handling) despite tests passing on SQLite.
**Remediation:** Add a CI step that runs `alembic upgrade head` against a real PostgreSQL container.
---
## 2. Scalability Risks
### HIGH PRIORITY
#### SR-001: N+1 Query Explosion in Scoring Engine ✅ PARTIALLY RESOLVED
**Current state (updated Feb 18):** The worst N+1 patterns have been addressed:
- `scoring_service.py``bulk_technique_scores()` performs 5 aggregated subqueries to fetch all scoring data in bulk, reducing organization-wide scoring from ~3,500 queries to ~5.
- `SATechniqueRepository.find_all_with_test_counts()` — Single query with subqueries for test counts, validated test counts, and detection rule counts.
- Heatmap service uses batch-fetching techniques.
**Remaining:** Individual technique scoring (`calculate_technique_score()`) still performs per-technique queries when called in isolation. `create_snapshot()` could benefit from using the bulk method.
**Impact:** Organization score calculation reduced from seconds to sub-second. Individual technique scoring unchanged.
---
#### SR-002: In-Memory Cache Does Not Scale
**Current state:** `score_cache.py` uses a Python dict with 300-second TTL. Each worker process has its own cache.
**Impact:** With N workers, each has a cold cache on startup and after every TTL expiration. Cache miss triggers the full org score calculation (3,500+ queries). Effectively no caching under multiple workers.
**Remediation:** Move cache to Redis. Invalidate granularly when tests or techniques change.
---
#### SR-003: Heatmap Endpoints Load All Techniques Without Pagination ✅ RESOLVED
**Current state (updated Feb 18):** The heatmap service has been extracted and optimized:
- `services/heatmap_service.py` — Dedicated service with batch-fetching techniques (pre-aggregated `test_counts`, `rule_counts` in 2 SQL subqueries instead of N+1).
- The `SATechniqueRepository.find_all_with_test_counts()` method provides a single-query alternative for scoring/heatmap use cases.
- Router reduced from ~528 lines to a thin delegation layer.
**No further action needed** for query performance. The repository method can replace direct usage in remaining endpoints.
---
### MEDIUM PRIORITY
#### SR-004: Reports Load Full Tables Into Memory
**Current state:** All 4 report endpoints load unbounded result sets:
| Endpoint | Pattern |
|----------|---------|
| `coverage-summary` | All techniques + per-technique test count query (N+1) |
| `coverage-csv` | Same as above + CSV serialization in memory |
| `test-results` | All tests, aggregated in Python |
| `remediation-status` | All tests, filtered in Python |
**Impact:** For datasets with thousands of tests, memory usage spikes. No streaming — entire response built in memory before sending.
**Remediation:** Use SQL aggregations. Stream CSV output. Add date range filters as required parameters.
---
#### SR-005: Operational Metrics N+1 on Audit Logs
**Current state:** MTTD and MTTR calculations in `operational_metrics_service.py` load all validated tests, then query `AuditLog` twice per test to find state transition timestamps.
**Impact:** For 500 validated tests: 1,000 audit log queries. Grows linearly with test count.
**Remediation:** Denormalize key timestamps onto the Test model (e.g., `red_started_at`, `blue_started_at`, `remediation_completed_at`) or use a single batch audit log query with window functions.
---
#### SR-006: Missing Database Indexes ✅ RESOLVED
**Current state (updated Feb 18):** All critical indexes are now in place:
| Table | Index | Status |
|-------|-------|--------|
| `tests` | `(technique_id, state)` | ✅ Exists (model `__table_args__`) |
| `tests` | `(created_at)`, `(state, created_at)` | ✅ Added in migration `b024` |
| `techniques` | `(tactic)` | ✅ Added in migration `b026` |
| `techniques` | `(status_global)` | ✅ Added in migration `b026` |
| `audit_logs` | `(entity_type, entity_id)`, `(timestamp)`, `(entity_type, entity_id, action)` | ✅ Exists (model `__table_args__`) |
| `detection_rules` | `(mitre_technique_id)`, `(source)`, `(severity)` | ✅ Exists (model `__table_args__`) |
**No further action needed.**
---
### LOW PRIORITY
#### SR-007: Single-Instance Scheduler Constraint
**Current state:** APScheduler runs in-process. If multiple backend instances exist, each runs its own scheduler — causing duplicate MITRE syncs, duplicate snapshots, duplicate campaign spawns.
**Impact:** No impact today (single instance), but blocks horizontal scaling.
**Remediation:** Use APScheduler PostgreSQL JobStore for distributed locking. Or migrate to Celery Beat.
---
#### SR-008: Evidence Presigned URLs Point to Internal Hostname
**Current state:** MinIO presigned URLs contain `minio:9000` (Docker internal hostname), which is not resolvable from the user's browser.
**Impact:** Evidence download links fail in production unless Nginx proxies MinIO or MinIO has a public endpoint.
**Remediation:** Configure `MINIO_EXTERNAL_ENDPOINT` env var. Use it when generating presigned URLs.
---
## 3. Security Risks
### HIGH PRIORITY
#### SEC-001: In-Memory Token Blacklist ✅ RESOLVED
**Current state (updated Feb 18):** The token blacklist is now Redis-backed:
- `infrastructure/redis_client.py` — Singleton Redis connection.
- `auth.py``blacklist_token()` and `is_token_blacklisted()` use Redis with TTL matching token expiration.
- Shared across all workers. Survives server restarts.
**No further action needed.**
---
#### SEC-002: Default Credentials in Configuration ✅ RESOLVED
**Current state (updated Feb 18):** Production startup validation now rejects default credentials:
- `config.py``SECRET_KEY` validation already existed; now also checks `MINIO_ACCESS_KEY` and `MINIO_SECRET_KEY` against their defaults (`minioadmin`). Fails fast with `RuntimeError` in production mode.
- The `install.sh` script generates random passwords for production.
**No further action needed.**
---
#### SEC-003: Rate Limiting Only on Login
**Current state:** SlowAPI rate limiting is applied only to `POST /auth/login` (5/minute). All other endpoints have no rate limits:
| Unprotected Endpoint | Risk |
|---------------------|------|
| `POST /users` | Bulk user creation |
| `POST /tests` | Resource exhaustion |
| `POST /system/sync-mitre` | Repeated expensive syncs |
| `POST /system/import-atomic-tests` | Repeated 40MB ZIP downloads |
| `POST /tests/{id}/evidence` | Large file upload flooding |
| `GET /reports/*` | Expensive report generation DoS |
**Impact:** An authenticated attacker (or compromised account) can DoS the system by triggering expensive operations repeatedly.
**Remediation:** Add tiered rate limits: strict on auth, moderate on write endpoints, relaxed on read endpoints. Add specific limits on sync/import endpoints (1/hour).
---
### MEDIUM PRIORITY
#### SEC-004: No Input Validation on Username ✅ RESOLVED
**Current state (updated Feb 18):** Username validation is now enforced:
- `schemas/user.py``_validate_username()` function checks: 3-50 characters, only letters/digits/underscores/hyphens, rejects reserved names (`admin`, `root`, `system`, `api`, `null`, `undefined`, `administrator`, `superuser`, `aegis`).
- Applied via `@field_validator("username")` on `UserCreate`.
- 9 unit tests covering valid, invalid, and reserved usernames.
**No further action needed.**
---
#### SEC-005: Timing-Based User Enumeration on Login ✅ RESOLVED
**Current state (updated Feb 18):** Login now uses constant-time comparison:
- `routers/auth.py` — Always runs `verify_password()` against a dummy bcrypt hash when user is not found, ensuring consistent response time regardless of whether the username exists.
**No further action needed.**
---
#### SEC-006: Pydantic Validation Errors Leak Schema Details
**Current state:** The global validation error handler returns full Pydantic error details:
```python
content={
"detail": "Validation error",
"code": "VALIDATION_ERROR",
"errors": exc.errors(), # Full field paths, types, constraints
}
```
**Impact:** Attackers can probe endpoints to discover internal field names, types, and validation rules.
**Remediation:** Sanitize error output in production. Return field names and human-readable messages only, strip internal type information.
---
#### SEC-007: No Password Complexity Requirements ✅ RESOLVED
**Current state (updated Feb 18):** Password complexity is enforced:
- `schemas/user.py``_validate_password_strength()` requires: minimum 12 characters, at least one uppercase, one lowercase, one digit, one special character.
- Applied on `UserCreate`, `UserUpdate`, and `PasswordChange` schemas.
- 6 unit tests covering all complexity rules.
**No further action needed.**
---
### LOW PRIORITY
#### SEC-008: CORS Origins Not Validated in Production
**Current state:** `CORS_ORIGINS` is a comma-separated string from environment. If set to `*` or overly broad patterns, credentials (HttpOnly cookies) are sent to unintended origins.
**Impact:** Low (requires misconfiguration), but could enable cross-origin attacks.
**Remediation:** Validate `CORS_ORIGINS` at startup — reject `*` when `AEGIS_ENV=production`.
---
#### SEC-009: No Audit Log for Failed Login Attempts
**Current state:** Successful logins are not audited. Failed logins are not audited. Only post-login actions are recorded.
**Impact:** Cannot detect brute force attacks or compromised account usage patterns.
**Remediation:** Log all login attempts (success/failure) to audit_logs with IP address and timestamp.
---
## 4. Maintainability Risks
### HIGH PRIORITY
#### MR-001: No Dependency Inversion — Everything Points to Concrete Implementations ✅ PARTIALLY RESOLVED
**Current state (updated Feb 18):** Protocol interfaces and dependency injection now exist for core entities:
- `domain/ports/repositories/technique_repository.py``@runtime_checkable` Protocol.
- `domain/ports/repositories/test_repository.py``@runtime_checkable` Protocol.
- `dependencies/repositories.py` — FastAPI `Depends()` wiring for `SATechniqueRepository` and `SATestRepository`.
- `domain/unit_of_work.py``UnitOfWork` context manager for transaction control.
**Remaining:** Services like `notification_service`, `audit_service`, `scoring_service` still use direct imports. Additional ports needed for storage, notifications, and event bus.
**Impact:** New code follows DIP. Old code will be migrated incrementally.
---
#### MR-002: Two Coexisting Architectural Patterns
**Current state:** Some routers delegate to services, others do everything inline. A developer cannot predict where to find or place logic.
| Pattern | Routers |
|---------|---------|
| Delegates to services | tests, scores, notifications, campaigns, snapshots |
| Direct DB queries | techniques, evidence, users, audit, reports, heatmap, metrics, detection_rules, threat_actors, data_sources, compliance |
**Impact:** Inconsistent codebase. New developers learn one pattern and find the other. Code reviews cannot enforce a single standard.
**Remediation:** Establish a single pattern (all through services/use cases) and migrate incrementally.
---
### MEDIUM PRIORITY
#### MR-003: No Type Checking Enforcement
**Current state:** `tsconfig.json` has `strict: true` for the frontend, but the backend has no `mypy` configuration. Python type hints exist but are never verified.
**Impact:** Type errors in Python code go undetected until runtime. Particularly risky for Optional fields and JSONB data.
**Remediation:** Add `mypy` to requirements. Create `mypy.ini` with strict settings. Add to CI pipeline.
---
#### MR-004: Test Infrastructure Uses SQLite Instead of PostgreSQL
**Current state:** `conftest.py` creates an in-memory SQLite database and patches PostgreSQL-specific types (UUID → String, JSONB → JSON):
```python
from sqlalchemy.dialects.postgresql import UUID as PG_UUID, JSONB
# Patch to use SQLite-compatible types
sqlalchemy.dialects.postgresql.UUID = _patched_uuid
sqlalchemy.dialects.postgresql.JSONB = _patched_jsonb
```
**Impact:** Tests pass on SQLite but may fail on real PostgreSQL. JSONB-specific queries (containment `@>`, GIN indexes) are untestable. UUID behavior differs between dialects.
**Remediation:** Use `testcontainers-python` to spin up a real PostgreSQL container for tests, or use PostgreSQL in CI.
---
#### MR-005: Frontend Types Not Generated from Backend Schemas
**Current state:** `types/models.ts` is manually maintained and must stay in sync with `schemas/*.py`. There is no code generation or validation step.
**Impact:** Type drift between frontend and backend. A backend schema change that isn't reflected in `types/models.ts` causes runtime errors in the frontend.
**Remediation:** Generate TypeScript types from OpenAPI spec (`openapi-typescript` or similar). Run as a pre-build step.
---
### LOW PRIORITY
#### MR-006: Documentation Scattered Across Multiple Formats
**Current state:** Documentation exists in `README.md`, `docs/API.md`, `docs/ARCHITECTURE.md`, `docs/DATA_SOURCES.md`, `docs/SCORING.md`, plus the new analysis documents. No central index or documentation site.
**Impact:** New developers must discover docs by browsing. No searchable documentation.
**Remediation:** Create a `docs/INDEX.md` linking all documents. Consider MkDocs or similar for a browsable doc site.
---
#### MR-007: No Conventional Commit or Changelog
**Current state:** No commit message convention enforced. No CHANGELOG file.
**Impact:** Difficult to understand what changed between releases. No automated release notes.
**Remediation:** Adopt Conventional Commits. Add commitlint as a pre-commit hook. Generate CHANGELOG automatically.
---
## 5. Recommended Medium-Term Improvements
### Architecture
| ID | Improvement | Effort | Impact | Status |
|----|-------------|--------|--------|--------|
| IMP-001 | Extract domain exceptions + error handler middleware | 2-3 days | Removes FastAPI dependency from services | ✅ Done |
| IMP-002 | Create repository layer for Test, Technique, Campaign | 1 week | Centralizes queries, enables caching and mocking | ✅ Done (Test, Technique) |
| IMP-003 | Extract heatmap/reports/metrics logic to application services | 1-2 weeks | Thin controllers, testable business logic | ✅ Heatmap done |
| IMP-004 | Persist scoring weights in database | 2-3 days | Eliminates mutable global state | Pending |
| IMP-005 | Add domain entities with behavior (rich models) | 2-3 weeks | Consolidates scattered business rules | ✅ Done (Test, Technique) |
### Scalability
| ID | Improvement | Effort | Impact | Status |
|----|-------------|--------|--------|--------|
| IMP-006 | Batch scoring queries (single SQL per metric) | 1 week | Reduces org score from 3,500 queries to ~10 | ✅ Done |
| IMP-007 | Add missing composite indexes | 1 day | Immediate query performance improvement | ✅ Done |
| IMP-008 | Move score cache to Redis | 2-3 days | Shared cache across workers | Pending |
| IMP-009 | Batch heatmap metadata queries | 2-3 days | Reduces heatmap from 1,400 to 3 queries | ✅ Done |
| IMP-010 | Denormalize MTTD/MTTR timestamps onto Test model | 3-5 days | Eliminates operational metrics N+1 | Pending |
### Security
| ID | Improvement | Effort | Impact | Status |
|----|-------------|--------|--------|--------|
| IMP-011 | Move token blacklist to Redis | 1-2 days | Fixes multi-instance logout | ✅ Done |
| IMP-012 | Reject default credentials in production | 0.5 days | Prevents insecure deployments | ✅ Done |
| IMP-013 | Add rate limiting to write/sync endpoints | 1 day | Prevents DoS from authenticated users | Pending |
| IMP-014 | Add password complexity validation | 0.5 days | Prevents weak passwords | ✅ Done |
| IMP-015 | Add login attempt auditing | 1 day | Enables brute force detection | Pending |
### DevOps
| ID | Improvement | Effort | Impact | Status |
|----|-------------|--------|--------|--------|
| IMP-016 | Create GitHub Actions CI pipeline | 1-2 days | Automated lint + type check + test | ✅ Done |
| IMP-017 | Add mypy strict type checking | 1-2 days | Catches type errors before runtime | Pending |
| IMP-018 | Replace SQLite test DB with PostgreSQL (testcontainers) | 1 day | Tests match production behavior | ✅ CI uses PG |
| IMP-019 | Generate frontend types from OpenAPI | 0.5 days | Eliminates frontend/backend type drift | Pending |
| IMP-020 | Add structured JSON logging | 1-2 days | Production-ready observability | Pending |
---
## 6. Priority Matrix
### Immediate (Sprint 1 — Week 1-2)
| ID | Item | Category | Effort |
|----|------|----------|--------|
| SEC-001 | Move token blacklist to Redis | Security | 1-2 days |
| SEC-002 | Reject default credentials in production | Security | 0.5 days |
| SR-006 | Add missing database indexes | Scalability | 1 day |
| TD-007 | Replace `except: pass` with logging in workflow service | Tech Debt | 0.5 days |
| SEC-007 | Add password complexity requirements | Security | 0.5 days |
| IMP-016 | Create basic CI pipeline | DevOps | 1-2 days |
**Total estimated effort: ~5-7 days**
---
### Short-Term (Sprint 2-3 — Week 3-6)
| ID | Item | Category | Effort |
|----|------|----------|--------|
| TD-003 | Extract domain exceptions, remove HTTPException from services | Tech Debt | 2-3 days |
| SR-001 | Batch scoring queries to eliminate N+1 | Scalability | 1 week |
| SR-003 | Batch heatmap metadata queries | Scalability | 2-3 days |
| TD-002 | Create repository layer for core entities | Tech Debt | 1 week |
| SEC-003 | Add rate limiting to write/sync endpoints | Security | 1 day |
| TD-004 | Persist scoring weights in database | Tech Debt | 2-3 days |
| SEC-009 | Add login attempt auditing | Security | 1 day |
| IMP-008 | Move score cache to Redis | Scalability | 2-3 days |
**Total estimated effort: ~3-4 weeks**
---
### Medium-Term (Month 2-3)
| ID | Item | Category | Effort |
|----|------|----------|--------|
| TD-001 | Extract heatmap/reports/metrics to application services | Tech Debt | 2-3 weeks |
| TD-008 | Expand test coverage to all routers and services | Maintainability | 2-3 weeks |
| TD-005 | Create rich domain entities (Clean Architecture Phase 2) | Tech Debt | 2-3 weeks |
| SR-004 | Optimize report endpoints (SQL aggregations, streaming) | Scalability | 1 week |
| SR-005 | Denormalize MTTD/MTTR timestamps | Scalability | 3-5 days |
| MR-003 | Add mypy type checking | Maintainability | 1-2 days |
| MR-004 | Replace SQLite tests with PostgreSQL | Maintainability | 1 day |
| MR-005 | Generate frontend types from OpenAPI | Maintainability | 0.5 days |
| IMP-020 | Structured JSON logging | DevOps | 1-2 days |
**Total estimated effort: ~6-8 weeks**
---
### Low Priority (Backlog)
| ID | Item | Category | Effort |
|----|------|----------|--------|
| TD-011 | Add retry logic to entrypoint scripts | Tech Debt | 0.5 days |
| TD-012 | Add migration tests against real PostgreSQL | Maintainability | 1 day |
| SR-007 | APScheduler PostgreSQL JobStore for horizontal scaling | Scalability | 2-3 days |
| SR-008 | Fix MinIO presigned URL hostname for production | Scalability | 1 day |
| SEC-008 | Validate CORS origins in production | Security | 0.5 days |
| SEC-005 | Constant-time login to prevent user enumeration | Security | 0.5 days |
| SEC-006 | Sanitize Pydantic validation errors in production | Security | 1 day |
| MR-006 | Create documentation index / MkDocs site | Maintainability | 1-2 days |
| MR-007 | Adopt Conventional Commits + CHANGELOG | Maintainability | 1 day |
| SEC-004 | Username input validation | Security | 0.5 days |
---
## Summary Scorecard
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Category │ High │ Med │ Low │ Total│ Resolved │ Open │
│────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│
│ Technical Debt │ 5 │ 4 │ 2 │ 11 │ 4 │ 7 │
│ Scalability Risks │ 3 │ 3 │ 2 │ 8 │ 3 │ 5 │
│ Security Risks │ 3 │ 4 │ 2 │ 9 │ 5 │ 4 │
│ Maintainability Risks │ 2 │ 3 │ 2 │ 7 │ 1 │ 6 │
│────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│
│ TOTAL │ 13 │ 14 │ 8 │ 35 │ 13 │ 22 │
└─────────────────────────────────────────────────────────────────────────────┘
Resolved: 13 of 35 items (37%)
Remaining estimated effort: ~7-9 weeks (down from ~12-15)
```