30 KiB
Aegis — Technical Debt, Risks & Improvement Plan
Author: Architecture review
Date: February 11, 2026 (updated February 18, 2026)
Scope: Backend, Frontend, Infrastructure, Security, Scalability, MaintainabilityNote: Items marked with ✅ have been resolved. See inline status annotations.
Table of Contents
- Technical Debt
- Scalability Risks
- Security Risks
- Maintainability Risks
- Recommended Medium-Term Improvements
- Priority Matrix
1. Technical Debt
HIGH PRIORITY
TD-001: Fat Controllers (Routers with Embedded Business Logic)
Current state: 11 of 21 routers execute raw SQLAlchemy queries directly. The worst offenders:
| Router | Lines | Embedded Logic |
|---|---|---|
heatmap.py |
528 | Query building + color mapping + ATT&CK Navigator JSON serialization + export |
tests.py |
664 | CRUD + template instantiation + timeline queries (workflow delegated) |
reports.py |
273 | Aggregation queries + CSV generation + JSON formatting |
compliance.py |
~350 | CRUD + import + gap analysis + CSV export |
metrics.py |
~316 | Complex aggregation queries with in-memory processing |
Impact: Cannot unit test business logic without spinning up FastAPI + DB. Logic duplication across routers. Changes to one query pattern must be replicated manually in every router that uses it.
Remediation: Extract query logic to service/repository layer. Each router endpoint should be < 20 lines.
TD-002: No Repository Layer — Scattered Duplicate Queries ✅ PARTIALLY RESOLVED
Current state (updated Feb 18): Repository ports (Protocol interfaces) and SQLAlchemy implementations now exist for Technique and Test:
domain/ports/repositories/technique_repository.py— Protocol withfind_by_id(),find_by_mitre_id(),list_all(),count_by_status(),find_all_with_test_counts(),save(), etc.domain/ports/repositories/test_repository.py— Protocol withfind_by_id(),list_by_technique(),get_states_and_results_for_technique(), etc.infrastructure/persistence/repositories/sa_technique_repository.py— Concrete implementation with batch queries (eliminates N+1 for heatmap/scoring).infrastructure/persistence/repositories/sa_test_repository.py— Concrete implementation.dependencies/repositories.py— FastAPIDepends()wiring.
Remaining: Old routers still use direct db.query(). New endpoints should use repositories; existing endpoints will be migrated incrementally.
Impact: New code has centralized query management. Old queries still scattered but coexist safely.
TD-003: Services Depend on FastAPI (HTTPException in Domain Logic) ✅ RESOLVED
Current state (updated Feb 18): Domain exceptions have been implemented and are in active use:
domain/errors.py— Full exception hierarchy:DomainError,EntityNotFoundError,DuplicateEntityError,InvalidStateTransition,BusinessRuleViolation,InvalidOperationError,PermissionViolation.domain/exceptions.py— Backward-compatible re-exports.middleware/error_handler.py— Maps domain exceptions to HTTP responses automatically (404, 409, 400, 403).test_workflow_service.py— Now raisesInvalidOperationErrorandInvalidStateTransitioninstead ofHTTPException.
No further action needed for the core services. Some secondary routers may still raise HTTPException directly (which is acceptable at the presentation layer).
TD-004: Mutable Global Settings at Runtime
Current state: The scores.py router mutates settings directly:
settings.SCORING_WEIGHT_TESTS = body.weight_tests
settings.SCORING_WEIGHT_DETECTION_RULES = body.weight_detection_rules
Impact: Changes lost on restart. Thread-unsafe with multiple workers. No audit trail for config changes.
Remediation: Persist scoring weights in the database. Create a ScoringConfig table. Load weights from DB in scoring_service.
TD-005: Anemic Domain Models ✅ PARTIALLY RESOLVED
Current state (updated Feb 18): Rich domain entities now exist alongside the ORM models:
domain/test_entity.py—TestEntitydataclass with full state machine (can_transition(),transition_to(),start_execution(),submit_red_evidence(),submit_blue_evidence(),validate(),reopen()), dual validation, pause/resume timers, and domain events. Comprehensive unit tests (46 tests).domain/entities/technique.py—TechniqueEntitywithrecalculate_status(),mark_reviewed(),flag_for_review(),create(),from_orm()/apply_to(). Comprehensive unit tests (16 tests).domain/value_objects/mitre_id.py— Immutable value object with ATT&CK ID validation.domain/value_objects/scoring_weights.py— Immutable weight set enforcing sum-to-100.
ORM models remain anemic (by design — they are persistence mapping only). Business logic lives in domain entities, bridged via from_orm()/apply_to().
Remaining: Campaign, ComplianceFramework, and other entities still lack domain entity counterparts.
MEDIUM PRIORITY
TD-006: Inconsistent Error Response Format
Current state: API error responses use three different formats:
| Format | Used In |
|---|---|
detail: "string" |
Most routers (techniques.py, users.py, evidence.py) |
detail: {message, code, ...} |
tests.py, test_workflow_service.py |
detail: "Validation error", code: "VALIDATION_ERROR", errors: [...] |
Global handler in main.py |
Impact: Frontend must handle multiple error shapes. No reliable error code for programmatic handling.
Remediation: Standardize all errors to {detail: string, code: string, errors?: [...]}.
TD-007: Silently Swallowed Exceptions in Workflow Service
Current state: test_workflow_service.py has 4 bare except Exception: pass blocks:
| Line | What is swallowed |
|---|---|
| 106 | notify_test_state_change() failure |
| 286 | Notification failure |
| 295 | Notification failure |
| 299 | Score cache invalidation failure |
Impact: Notification failures and cache invalidation errors go completely unnoticed. Users may miss critical workflow notifications with no trace in logs.
Remediation: Replace pass with logger.warning(...) at minimum. Consider async event dispatch so failures don't block the main flow.
TD-008: Test Suite Gaps
Current state: ~167 test functions across 18 files, but coverage is uneven:
| Category | Covered | Not Covered |
|---|---|---|
| Routers | auth, techniques, tests, evidence, test_templates, metrics, system | audit, campaigns, compliance, d3fend, detection_rules, heatmap, operational_metrics, scores, snapshots, threat_actors, users |
| Services | workflow, status, atomic_import, campaign, scoring, notifications | audit, caldera, compliance_import, d3fend, elastic, intel, lolbas, mitre_sync, score_cache, sigma, threat_actor_import |
4 integration tests are pytest.skipped by default (Sigma, LOLBAS, CALDERA, Elastic full imports).
Some tests use inspect.getsource() to verify code structure rather than actually calling endpoints.
Impact: Regressions in untested routers/services go undetected. No security-focused tests (injection, rate limiting, CSRF).
Remediation: Add integration tests for all routers. Add dedicated security test suite. Run skipped integration tests in CI.
TD-009: No CI/CD Pipeline ✅ RESOLVED
Current state (updated Feb 18): A fully functional CI pipeline exists at .github/workflows/ci.yml:
- Runs
rufflinting on every push/PR. - Runs
pytestagainst a real PostgreSQL + Redis service container. - Tests run against the same stack as production (not SQLite).
Additionally, scripts/agent_validate_backend.sh provides a local validation script that runs lint + tests inside the Docker container.
No further action needed for basic CI. Potential enhancements: add mypy type checking, Docker build verification.
TD-010: Unstructured Logging
Current state: Logging uses plain format strings with no structured fields:
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)-8s %(name)s — %(message)s",
)
Global exception handlers use logging.error(f"...") instead of a logger instance. No request ID, user ID, or correlation ID in log output.
Impact: Cannot query logs for "all actions by user X" or "all errors in request Y". Log analysis in production requires manual grep.
Remediation: Add structured JSON logging (e.g., structlog or python-json-logger). Include request_id middleware.
LOW PRIORITY
TD-011: Entrypoint Scripts Have No Retry Logic
Current state: Both entrypoint.sh and entrypoint.prod.sh use set -e. If alembic upgrade head or python -m app.seed fails, Uvicorn never starts. No retry, no clear error message.
Impact: Transient DB connection failures during container startup cause the backend to fail permanently until manually restarted (Docker restart: always will retry, but seed may fail repeatedly).
Remediation: Add retry loop for migration with backoff. Make seed idempotent and non-fatal.
TD-012: No Database Migration Tests
Current state: Alembic migrations (18 versions) are never tested in isolation. The test suite uses in-memory SQLite with tables created from models, bypassing Alembic entirely.
Impact: Migration scripts may fail on real PostgreSQL (different dialect, JSONB handling) despite tests passing on SQLite.
Remediation: Add a CI step that runs alembic upgrade head against a real PostgreSQL container.
2. Scalability Risks
HIGH PRIORITY
SR-001: N+1 Query Explosion in Scoring Engine ✅ PARTIALLY RESOLVED
Current state (updated Feb 18): The worst N+1 patterns have been addressed:
scoring_service.py—bulk_technique_scores()performs 5 aggregated subqueries to fetch all scoring data in bulk, reducing organization-wide scoring from ~3,500 queries to ~5.SATechniqueRepository.find_all_with_test_counts()— Single query with subqueries for test counts, validated test counts, and detection rule counts.- Heatmap service uses batch-fetching techniques.
Remaining: Individual technique scoring (calculate_technique_score()) still performs per-technique queries when called in isolation. create_snapshot() could benefit from using the bulk method.
Impact: Organization score calculation reduced from seconds to sub-second. Individual technique scoring unchanged.
SR-002: In-Memory Cache Does Not Scale
Current state: score_cache.py uses a Python dict with 300-second TTL. Each worker process has its own cache.
Impact: With N workers, each has a cold cache on startup and after every TTL expiration. Cache miss triggers the full org score calculation (3,500+ queries). Effectively no caching under multiple workers.
Remediation: Move cache to Redis. Invalidate granularly when tests or techniques change.
SR-003: Heatmap Endpoints Load All Techniques Without Pagination ✅ RESOLVED
Current state (updated Feb 18): The heatmap service has been extracted and optimized:
services/heatmap_service.py— Dedicated service with batch-fetching techniques (pre-aggregatedtest_counts,rule_countsin 2 SQL subqueries instead of N+1).- The
SATechniqueRepository.find_all_with_test_counts()method provides a single-query alternative for scoring/heatmap use cases. - Router reduced from ~528 lines to a thin delegation layer.
No further action needed for query performance. The repository method can replace direct usage in remaining endpoints.
MEDIUM PRIORITY
SR-004: Reports Load Full Tables Into Memory
Current state: All 4 report endpoints load unbounded result sets:
| Endpoint | Pattern |
|---|---|
coverage-summary |
All techniques + per-technique test count query (N+1) |
coverage-csv |
Same as above + CSV serialization in memory |
test-results |
All tests, aggregated in Python |
remediation-status |
All tests, filtered in Python |
Impact: For datasets with thousands of tests, memory usage spikes. No streaming — entire response built in memory before sending.
Remediation: Use SQL aggregations. Stream CSV output. Add date range filters as required parameters.
SR-005: Operational Metrics N+1 on Audit Logs
Current state: MTTD and MTTR calculations in operational_metrics_service.py load all validated tests, then query AuditLog twice per test to find state transition timestamps.
Impact: For 500 validated tests: 1,000 audit log queries. Grows linearly with test count.
Remediation: Denormalize key timestamps onto the Test model (e.g., red_started_at, blue_started_at, remediation_completed_at) or use a single batch audit log query with window functions.
SR-006: Missing Database Indexes ✅ RESOLVED
Current state (updated Feb 18): All critical indexes are now in place:
| Table | Index | Status |
|---|---|---|
tests |
(technique_id, state) |
✅ Exists (model __table_args__) |
tests |
(created_at), (state, created_at) |
✅ Added in migration b024 |
techniques |
(tactic) |
✅ Added in migration b026 |
techniques |
(status_global) |
✅ Added in migration b026 |
audit_logs |
(entity_type, entity_id), (timestamp), (entity_type, entity_id, action) |
✅ Exists (model __table_args__) |
detection_rules |
(mitre_technique_id), (source), (severity) |
✅ Exists (model __table_args__) |
No further action needed.
LOW PRIORITY
SR-007: Single-Instance Scheduler Constraint
Current state: APScheduler runs in-process. If multiple backend instances exist, each runs its own scheduler — causing duplicate MITRE syncs, duplicate snapshots, duplicate campaign spawns.
Impact: No impact today (single instance), but blocks horizontal scaling.
Remediation: Use APScheduler PostgreSQL JobStore for distributed locking. Or migrate to Celery Beat.
SR-008: Evidence Presigned URLs Point to Internal Hostname
Current state: MinIO presigned URLs contain minio:9000 (Docker internal hostname), which is not resolvable from the user's browser.
Impact: Evidence download links fail in production unless Nginx proxies MinIO or MinIO has a public endpoint.
Remediation: Configure MINIO_EXTERNAL_ENDPOINT env var. Use it when generating presigned URLs.
3. Security Risks
HIGH PRIORITY
SEC-001: In-Memory Token Blacklist ✅ RESOLVED
Current state (updated Feb 18): The token blacklist is now Redis-backed:
infrastructure/redis_client.py— Singleton Redis connection.auth.py—blacklist_token()andis_token_blacklisted()use Redis with TTL matching token expiration.- Shared across all workers. Survives server restarts.
No further action needed.
SEC-002: Default Credentials in Configuration ✅ RESOLVED
Current state (updated Feb 18): Production startup validation now rejects default credentials:
config.py—SECRET_KEYvalidation already existed; now also checksMINIO_ACCESS_KEYandMINIO_SECRET_KEYagainst their defaults (minioadmin). Fails fast withRuntimeErrorin production mode.- The
install.shscript generates random passwords for production.
No further action needed.
SEC-003: Rate Limiting Only on Login
Current state: SlowAPI rate limiting is applied only to POST /auth/login (5/minute). All other endpoints have no rate limits:
| Unprotected Endpoint | Risk |
|---|---|
POST /users |
Bulk user creation |
POST /tests |
Resource exhaustion |
POST /system/sync-mitre |
Repeated expensive syncs |
POST /system/import-atomic-tests |
Repeated 40MB ZIP downloads |
POST /tests/{id}/evidence |
Large file upload flooding |
GET /reports/* |
Expensive report generation DoS |
Impact: An authenticated attacker (or compromised account) can DoS the system by triggering expensive operations repeatedly.
Remediation: Add tiered rate limits: strict on auth, moderate on write endpoints, relaxed on read endpoints. Add specific limits on sync/import endpoints (1/hour).
MEDIUM PRIORITY
SEC-004: No Input Validation on Username ✅ RESOLVED
Current state (updated Feb 18): Username validation is now enforced:
schemas/user.py—_validate_username()function checks: 3-50 characters, only letters/digits/underscores/hyphens, rejects reserved names (admin,root,system,api,null,undefined,administrator,superuser,aegis).- Applied via
@field_validator("username")onUserCreate. - 9 unit tests covering valid, invalid, and reserved usernames.
No further action needed.
SEC-005: Timing-Based User Enumeration on Login ✅ RESOLVED
Current state (updated Feb 18): Login now uses constant-time comparison:
routers/auth.py— Always runsverify_password()against a dummy bcrypt hash when user is not found, ensuring consistent response time regardless of whether the username exists.
No further action needed.
SEC-006: Pydantic Validation Errors Leak Schema Details
Current state: The global validation error handler returns full Pydantic error details:
content={
"detail": "Validation error",
"code": "VALIDATION_ERROR",
"errors": exc.errors(), # Full field paths, types, constraints
}
Impact: Attackers can probe endpoints to discover internal field names, types, and validation rules.
Remediation: Sanitize error output in production. Return field names and human-readable messages only, strip internal type information.
SEC-007: No Password Complexity Requirements ✅ RESOLVED
Current state (updated Feb 18): Password complexity is enforced:
schemas/user.py—_validate_password_strength()requires: minimum 12 characters, at least one uppercase, one lowercase, one digit, one special character.- Applied on
UserCreate,UserUpdate, andPasswordChangeschemas. - 6 unit tests covering all complexity rules.
No further action needed.
LOW PRIORITY
SEC-008: CORS Origins Not Validated in Production
Current state: CORS_ORIGINS is a comma-separated string from environment. If set to * or overly broad patterns, credentials (HttpOnly cookies) are sent to unintended origins.
Impact: Low (requires misconfiguration), but could enable cross-origin attacks.
Remediation: Validate CORS_ORIGINS at startup — reject * when AEGIS_ENV=production.
SEC-009: No Audit Log for Failed Login Attempts
Current state: Successful logins are not audited. Failed logins are not audited. Only post-login actions are recorded.
Impact: Cannot detect brute force attacks or compromised account usage patterns.
Remediation: Log all login attempts (success/failure) to audit_logs with IP address and timestamp.
4. Maintainability Risks
HIGH PRIORITY
MR-001: No Dependency Inversion — Everything Points to Concrete Implementations ✅ PARTIALLY RESOLVED
Current state (updated Feb 18): Protocol interfaces and dependency injection now exist for core entities:
domain/ports/repositories/technique_repository.py—@runtime_checkableProtocol.domain/ports/repositories/test_repository.py—@runtime_checkableProtocol.dependencies/repositories.py— FastAPIDepends()wiring forSATechniqueRepositoryandSATestRepository.domain/unit_of_work.py—UnitOfWorkcontext manager for transaction control.
Remaining: Services like notification_service, audit_service, scoring_service still use direct imports. Additional ports needed for storage, notifications, and event bus.
Impact: New code follows DIP. Old code will be migrated incrementally.
MR-002: Two Coexisting Architectural Patterns
Current state: Some routers delegate to services, others do everything inline. A developer cannot predict where to find or place logic.
| Pattern | Routers |
|---|---|
| Delegates to services | tests, scores, notifications, campaigns, snapshots |
| Direct DB queries | techniques, evidence, users, audit, reports, heatmap, metrics, detection_rules, threat_actors, data_sources, compliance |
Impact: Inconsistent codebase. New developers learn one pattern and find the other. Code reviews cannot enforce a single standard.
Remediation: Establish a single pattern (all through services/use cases) and migrate incrementally.
MEDIUM PRIORITY
MR-003: No Type Checking Enforcement
Current state: tsconfig.json has strict: true for the frontend, but the backend has no mypy configuration. Python type hints exist but are never verified.
Impact: Type errors in Python code go undetected until runtime. Particularly risky for Optional fields and JSONB data.
Remediation: Add mypy to requirements. Create mypy.ini with strict settings. Add to CI pipeline.
MR-004: Test Infrastructure Uses SQLite Instead of PostgreSQL
Current state: conftest.py creates an in-memory SQLite database and patches PostgreSQL-specific types (UUID → String, JSONB → JSON):
from sqlalchemy.dialects.postgresql import UUID as PG_UUID, JSONB
# Patch to use SQLite-compatible types
sqlalchemy.dialects.postgresql.UUID = _patched_uuid
sqlalchemy.dialects.postgresql.JSONB = _patched_jsonb
Impact: Tests pass on SQLite but may fail on real PostgreSQL. JSONB-specific queries (containment @>, GIN indexes) are untestable. UUID behavior differs between dialects.
Remediation: Use testcontainers-python to spin up a real PostgreSQL container for tests, or use PostgreSQL in CI.
MR-005: Frontend Types Not Generated from Backend Schemas
Current state: types/models.ts is manually maintained and must stay in sync with schemas/*.py. There is no code generation or validation step.
Impact: Type drift between frontend and backend. A backend schema change that isn't reflected in types/models.ts causes runtime errors in the frontend.
Remediation: Generate TypeScript types from OpenAPI spec (openapi-typescript or similar). Run as a pre-build step.
LOW PRIORITY
MR-006: Documentation Scattered Across Multiple Formats
Current state: Documentation exists in README.md, docs/API.md, docs/ARCHITECTURE.md, docs/DATA_SOURCES.md, docs/SCORING.md, plus the new analysis documents. No central index or documentation site.
Impact: New developers must discover docs by browsing. No searchable documentation.
Remediation: Create a docs/INDEX.md linking all documents. Consider MkDocs or similar for a browsable doc site.
MR-007: No Conventional Commit or Changelog
Current state: No commit message convention enforced. No CHANGELOG file.
Impact: Difficult to understand what changed between releases. No automated release notes.
Remediation: Adopt Conventional Commits. Add commitlint as a pre-commit hook. Generate CHANGELOG automatically.
5. Recommended Medium-Term Improvements
Architecture
| ID | Improvement | Effort | Impact | Status |
|---|---|---|---|---|
| IMP-001 | Extract domain exceptions + error handler middleware | 2-3 days | Removes FastAPI dependency from services | ✅ Done |
| IMP-002 | Create repository layer for Test, Technique, Campaign | 1 week | Centralizes queries, enables caching and mocking | ✅ Done (Test, Technique) |
| IMP-003 | Extract heatmap/reports/metrics logic to application services | 1-2 weeks | Thin controllers, testable business logic | ✅ Heatmap done |
| IMP-004 | Persist scoring weights in database | 2-3 days | Eliminates mutable global state | Pending |
| IMP-005 | Add domain entities with behavior (rich models) | 2-3 weeks | Consolidates scattered business rules | ✅ Done (Test, Technique) |
Scalability
| ID | Improvement | Effort | Impact | Status |
|---|---|---|---|---|
| IMP-006 | Batch scoring queries (single SQL per metric) | 1 week | Reduces org score from 3,500 queries to ~10 | ✅ Done |
| IMP-007 | Add missing composite indexes | 1 day | Immediate query performance improvement | ✅ Done |
| IMP-008 | Move score cache to Redis | 2-3 days | Shared cache across workers | Pending |
| IMP-009 | Batch heatmap metadata queries | 2-3 days | Reduces heatmap from 1,400 to 3 queries | ✅ Done |
| IMP-010 | Denormalize MTTD/MTTR timestamps onto Test model | 3-5 days | Eliminates operational metrics N+1 | Pending |
Security
| ID | Improvement | Effort | Impact | Status |
|---|---|---|---|---|
| IMP-011 | Move token blacklist to Redis | 1-2 days | Fixes multi-instance logout | ✅ Done |
| IMP-012 | Reject default credentials in production | 0.5 days | Prevents insecure deployments | ✅ Done |
| IMP-013 | Add rate limiting to write/sync endpoints | 1 day | Prevents DoS from authenticated users | Pending |
| IMP-014 | Add password complexity validation | 0.5 days | Prevents weak passwords | ✅ Done |
| IMP-015 | Add login attempt auditing | 1 day | Enables brute force detection | Pending |
DevOps
| ID | Improvement | Effort | Impact | Status |
|---|---|---|---|---|
| IMP-016 | Create GitHub Actions CI pipeline | 1-2 days | Automated lint + type check + test | ✅ Done |
| IMP-017 | Add mypy strict type checking | 1-2 days | Catches type errors before runtime | Pending |
| IMP-018 | Replace SQLite test DB with PostgreSQL (testcontainers) | 1 day | Tests match production behavior | ✅ CI uses PG |
| IMP-019 | Generate frontend types from OpenAPI | 0.5 days | Eliminates frontend/backend type drift | Pending |
| IMP-020 | Add structured JSON logging | 1-2 days | Production-ready observability | Pending |
6. Priority Matrix
Immediate (Sprint 1 — Week 1-2)
| ID | Item | Category | Effort |
|---|---|---|---|
| SEC-001 | Move token blacklist to Redis | Security | 1-2 days |
| SEC-002 | Reject default credentials in production | Security | 0.5 days |
| SR-006 | Add missing database indexes | Scalability | 1 day |
| TD-007 | Replace except: pass with logging in workflow service |
Tech Debt | 0.5 days |
| SEC-007 | Add password complexity requirements | Security | 0.5 days |
| IMP-016 | Create basic CI pipeline | DevOps | 1-2 days |
Total estimated effort: ~5-7 days
Short-Term (Sprint 2-3 — Week 3-6)
| ID | Item | Category | Effort |
|---|---|---|---|
| TD-003 | Extract domain exceptions, remove HTTPException from services | Tech Debt | 2-3 days |
| SR-001 | Batch scoring queries to eliminate N+1 | Scalability | 1 week |
| SR-003 | Batch heatmap metadata queries | Scalability | 2-3 days |
| TD-002 | Create repository layer for core entities | Tech Debt | 1 week |
| SEC-003 | Add rate limiting to write/sync endpoints | Security | 1 day |
| TD-004 | Persist scoring weights in database | Tech Debt | 2-3 days |
| SEC-009 | Add login attempt auditing | Security | 1 day |
| IMP-008 | Move score cache to Redis | Scalability | 2-3 days |
Total estimated effort: ~3-4 weeks
Medium-Term (Month 2-3)
| ID | Item | Category | Effort |
|---|---|---|---|
| TD-001 | Extract heatmap/reports/metrics to application services | Tech Debt | 2-3 weeks |
| TD-008 | Expand test coverage to all routers and services | Maintainability | 2-3 weeks |
| TD-005 | Create rich domain entities (Clean Architecture Phase 2) | Tech Debt | 2-3 weeks |
| SR-004 | Optimize report endpoints (SQL aggregations, streaming) | Scalability | 1 week |
| SR-005 | Denormalize MTTD/MTTR timestamps | Scalability | 3-5 days |
| MR-003 | Add mypy type checking | Maintainability | 1-2 days |
| MR-004 | Replace SQLite tests with PostgreSQL | Maintainability | 1 day |
| MR-005 | Generate frontend types from OpenAPI | Maintainability | 0.5 days |
| IMP-020 | Structured JSON logging | DevOps | 1-2 days |
Total estimated effort: ~6-8 weeks
Low Priority (Backlog)
| ID | Item | Category | Effort |
|---|---|---|---|
| TD-011 | Add retry logic to entrypoint scripts | Tech Debt | 0.5 days |
| TD-012 | Add migration tests against real PostgreSQL | Maintainability | 1 day |
| SR-007 | APScheduler PostgreSQL JobStore for horizontal scaling | Scalability | 2-3 days |
| SR-008 | Fix MinIO presigned URL hostname for production | Scalability | 1 day |
| SEC-008 | Validate CORS origins in production | Security | 0.5 days |
| SEC-005 | Constant-time login to prevent user enumeration | Security | 0.5 days |
| SEC-006 | Sanitize Pydantic validation errors in production | Security | 1 day |
| MR-006 | Create documentation index / MkDocs site | Maintainability | 1-2 days |
| MR-007 | Adopt Conventional Commits + CHANGELOG | Maintainability | 1 day |
| SEC-004 | Username input validation | Security | 0.5 days |
Summary Scorecard
┌─────────────────────────────────────────────────────────────────────────────┐
│ Category │ High │ Med │ Low │ Total│ Resolved │ Open │
│────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│
│ Technical Debt │ 5 │ 4 │ 2 │ 11 │ 4 │ 7 │
│ Scalability Risks │ 3 │ 3 │ 2 │ 8 │ 3 │ 5 │
│ Security Risks │ 3 │ 4 │ 2 │ 9 │ 5 │ 4 │
│ Maintainability Risks │ 2 │ 3 │ 2 │ 7 │ 1 │ 6 │
│────────────────────────────────┼──────┼─────┼─────┼──────┼──────────┼──────│
│ TOTAL │ 13 │ 14 │ 8 │ 35 │ 13 │ 22 │
└─────────────────────────────────────────────────────────────────────────────┘
Resolved: 13 of 35 items (37%)
Remaining estimated effort: ~7-9 weeks (down from ~12-15)