refactor(detection-rules): extract query/business logic to detection_rule_service, router is thin HTTP adapter

2026-02-19 17:39:31 +01:00
parent d305db8794
commit 560fc0c9f0
7 changed files with 5853 additions and 282 deletions
@@ -0,0 +1,394 @@
+# Aegis — Architecture Decision Records (ADR)
+
+> **Date:** February 11, 2026  
+> **Status:** All decisions are **Accepted** and currently in effect.
+
+---
+
+## Index
+
+| ADR | Title | Status |
+|-----|-------|--------|
+| [ADR-001](#adr-001-fastapi-as-backend-framework) | FastAPI as Backend Framework | Accepted |
+| [ADR-002](#adr-002-postgresql-with-jsonb-as-primary-database) | PostgreSQL with JSONB as Primary Database | Accepted |
+| [ADR-003](#adr-003-minio-for-evidence-storage) | MinIO for Evidence Storage | Accepted |
+| [ADR-004](#adr-004-docker-compose-for-deployment) | Docker Compose for Deployment | Accepted |
+| [ADR-005](#adr-005-modular-monolith-over-microservices) | Modular Monolith over Microservices | Accepted |
+| [ADR-006](#adr-006-apscheduler-in-process-over-external-job-system) | APScheduler In-Process over External Job System | Accepted |
+
+---
+
+## ADR-001: FastAPI as Backend Framework
+
+**Date:** Project inception  
+**Status:** Accepted
+
+### Context
+
+Aegis is an internal security platform for managing MITRE ATT&CK coverage through Red/Blue team validation workflows. The backend must:
+
+- Expose a REST API consumed by a React SPA (21 pages, 80+ endpoints).
+- Handle CRUD operations for 18+ domain entities with complex filtering and joins.
+- Support file uploads (evidence) and streaming downloads (CSV/JSON exports).
+- Integrate with external APIs (MITRE TAXII 2.0, GitHub REST, D3FEND REST).
+- Enforce RBAC authorization across 6 roles.
+- Be developed and maintained by a small team requiring fast iteration.
+- Run in a containerized environment with Python as the team's primary language.
+
+### Decision
+
+We chose **FastAPI** as the backend framework, served by **Uvicorn** (ASGI).
+
+Key factors:
+- **Automatic OpenAPI/Swagger** generation from type hints reduces documentation burden for 80+ endpoints.
+- **Pydantic integration** provides request/response validation with zero boilerplate, critical for a schema-heavy domain (test workflows, scoring payloads, compliance data).
+- **`Depends()` system** provides clean dependency injection for auth, DB sessions, and role checks without a third-party DI container.
+- **Async-capable** but allows synchronous route handlers, which matters because SQLAlchemy (sync) is the ORM and all external data imports are CPU/IO-bound synchronous operations.
+- **Performance** is sufficient for an internal tool (< 100 concurrent users) without needing Go/Rust-level throughput.
+- **Python ecosystem** gives direct access to `taxii2-client`, `pySigma`, `boto3`, `PyYAML`, and `toml` — all required for the 8 external data source integrations.
+
+### Consequences
+
+**Positive:**
+- Swagger UI available in development (`/docs`) for rapid API exploration and testing.
+- Pydantic schemas act as living documentation for the API contract.
+- `Depends()` chain for `get_db` → `get_current_user` → `require_role()` is concise and composable.
+- `python-jose` + `passlib` integrate naturally for JWT/bcrypt auth.
+- SlowAPI integrates directly with FastAPI for rate limiting.
+
+**Negative:**
+- The `Depends()` system encourages passing `db: Session` directly into route handlers, which has led to routers containing raw SQLAlchemy queries instead of delegating to a service/repository layer (see ADR analysis — 11 of 21 routers query the DB directly).
+- Synchronous route handlers block the event loop when performing long operations (MITRE sync ZIP downloads can take 30+ seconds), mitigated by Nginx proxy timeout of 300s.
+- No built-in background task system beyond `BackgroundTasks` (which is request-scoped), requiring APScheduler for scheduled jobs (see ADR-006).
+
+**Risks:**
+- FastAPI's ease of putting logic in route handlers has contributed to "fat controllers" — this is a developer discipline issue, not a framework limitation.
+
+### Alternatives Considered
+
+| Alternative | Reason Rejected |
+|------------|-----------------|
+| **Django + DRF** | Heavier ORM opinions, admin panel unnecessary, slower startup. Django's ORM lacks SQLAlchemy's flexibility with JSONB and complex joins. |
+| **Flask + Flask-RESTful** | No built-in validation, no auto-generated OpenAPI, manual Swagger setup. Would require marshmallow or similar for schema validation. |
+| **Go (Gin/Echo)** | Team's primary expertise is Python. The 8 data source integrations rely heavily on Python libraries (pySigma, taxii2-client, PyYAML). |
+| **NestJS (Node.js)** | Would split the team across two runtimes. Python libraries for STIX/TAXII and Sigma rule parsing have no mature Node.js equivalents. |
+
+---
+
+## ADR-002: PostgreSQL with JSONB as Primary Database
+
+**Date:** Project inception  
+**Status:** Accepted
+
+### Context
+
+Aegis manages a complex relational domain: techniques have tests, tests belong to campaigns, threat actors map to techniques, compliance controls map to techniques, detection rules map to techniques and tests. This is a deeply relational model with 18+ tables and many-to-many relationships.
+
+However, several entities also carry semi-structured data that varies by source:
+- **Audit logs** — `details` field contains arbitrary action metadata (different structure per action type).
+- **Threat actors** — `aliases`, `target_sectors`, `target_regions`, `references` are variable-length arrays/objects from STIX 2.0 bundles.
+- **Detection rules** — `platforms` (array), `log_sources` (object with varying keys like `product`, `service`, `category`).
+- **Data sources** — `last_sync_stats` (object with import-specific counters), `config` (source-specific configuration).
+- **Techniques** — `platforms` (array of OS names from ATT&CK).
+- **Campaigns** — `tags` (user-defined array).
+
+This data is imported from external sources with varying schemas (STIX JSON, Sigma YAML, Elastic TOML) and must be stored without rigid column definitions.
+
+### Decision
+
+We chose **PostgreSQL 15** as the primary database, using its native **JSONB** column type for semi-structured fields alongside traditional relational columns for the core domain.
+
+The schema is managed by **Alembic** (18 migration versions) with **SQLAlchemy** ORM using `sqlalchemy.dialects.postgresql.JSONB`.
+
+### Consequences
+
+**Positive:**
+- Relational integrity enforced with foreign keys for the core domain (test → technique, campaign → test, evidence → test, etc.).
+- JSONB columns store variable-structure data without schema migrations when external sources change their format.
+- JSONB supports GIN indexing for efficient containment queries (`@>` operator) on arrays like `platforms` and `target_sectors`.
+- Single database to operate — no need for a separate document store.
+- PostgreSQL's mature ecosystem: `pg_dump` for backups, `pg_isready` for health checks, extensive monitoring tooling.
+- SQLAlchemy's `JSONB` type allows Python dict/list access with full query support.
+
+**Negative:**
+- JSONB fields bypass ORM-level validation — the schema for `details`, `config`, `references` etc. is only enforced by application code (Pydantic schemas on input), not by the database.
+- Complex queries mixing relational joins with JSONB containment can be harder to optimize and debug.
+- No GIN indexes are currently defined in migrations for JSONB columns, meaning array containment queries may perform full scans on large datasets.
+- JSONB fields in audit logs make structured querying across action types difficult (e.g., "find all audit entries where details.old_state = 'draft'").
+
+**Risks:**
+- As JSONB usage grows, the boundary between "should be a column" and "should be JSONB" can blur. Currently well-contained to arrays and metadata fields.
+
+### Alternatives Considered
+
+| Alternative | Reason Rejected |
+|------------|-----------------|
+| **PostgreSQL without JSONB** | Would require separate junction tables for every array field (technique_platforms, actor_aliases, actor_sectors, etc.), adding 10+ tables for data that is always read as a whole array. |
+| **MongoDB** | The core domain is deeply relational (techniques ↔ tests ↔ campaigns ↔ threat actors). Modeling this in MongoDB would require denormalization, embedded documents, or manual reference integrity — trading JSONB flexibility for relational integrity loss. |
+| **PostgreSQL + MongoDB (dual)** | Operational complexity of two database systems is unjustified for the current JSONB usage (~12 columns across 6 tables). |
+| **MySQL 8 with JSON** | PostgreSQL's JSONB is binary-indexed and faster for containment queries. MySQL's JSON type is text-based with function-based indexing. PostgreSQL also has superior support for UUID primary keys (native type vs BINARY(16)). |
+
+---
+
+## ADR-003: MinIO for Evidence Storage
+
+**Date:** Project inception  
+**Status:** Accepted
+
+### Context
+
+The Red/Blue team validation workflow requires both teams to upload evidence files (screenshots, log files, PCAPs, documents) to support their test findings. Requirements:
+
+- Files range from small screenshots (KB) to large PCAPs (hundreds of MB).
+- Files must be associated with specific tests and teams (red/blue).
+- Files must be downloadable by authorized users via the browser.
+- Storage must be independent from the application database (no BLOBs in PostgreSQL).
+- The platform is deployed on-premise via Docker Compose — cloud-native S3 is not available.
+- The upload/download API must be simple and well-supported in Python.
+
+### Decision
+
+We chose **MinIO** as an S3-compatible object storage system, accessed via **boto3** (AWS S3 SDK for Python).
+
+Implementation details:
+- A single `evidence` bucket is auto-created on backend startup (`ensure_bucket_exists()`).
+- Files are uploaded with `put_object()` using a generated UUID-based key.
+- Downloads use presigned URLs (`generate_presigned_url()`) with 1-hour expiration.
+- The MinIO client is a module-level singleton in `storage.py`.
+- Evidence metadata (filename, MIME type, size, team, test association) is stored in PostgreSQL; only the binary content lives in MinIO.
+
+### Consequences
+
+**Positive:**
+- S3-compatible API means zero code changes if migrating to AWS S3, GCS, or any S3-compatible service.
+- boto3 is the most mature and well-documented S3 client library in Python.
+- Presigned URLs offload download bandwidth from the backend — the browser fetches directly from MinIO.
+- Binary data stays out of PostgreSQL, keeping the database lean and backups fast.
+- MinIO runs as a single Docker container with a persistent volume — simple to deploy and back up.
+- MinIO Console (port 9001) provides a web UI for administrators to inspect stored files.
+
+**Negative:**
+- Presigned URLs currently point to `minio:9000` (Docker internal hostname), which is not accessible from the browser in production without additional Nginx configuration or a public MinIO endpoint.
+- No file virus scanning or content validation before storage.
+- No lifecycle policies configured (no automatic deletion of old evidence).
+- The module-level singleton client means the MinIO connection configuration cannot be changed at runtime (acceptable for the current deployment model).
+
+**Risks:**
+- If MinIO container is lost and the volume is not backed up, all evidence files are permanently lost. Evidence metadata in PostgreSQL would reference non-existent files.
+
+### Alternatives Considered
+
+| Alternative | Reason Rejected |
+|------------|-----------------|
+| **PostgreSQL BYTEA/BLOB** | Storing binary files in the database bloats backups, degrades query performance, and makes streaming large files complex. PostgreSQL is not designed as a file store. |
+| **Local filesystem** | Not portable across container restarts without host volume mounts. No presigned URL support, requiring the backend to proxy all downloads. No built-in replication or management UI. |
+| **AWS S3** | Requires cloud account and internet connectivity. The platform is designed for on-premise deployment where external cloud services may not be permitted. |
+| **SeaweedFS** | Less mature ecosystem, smaller community. The S3-compatible layer is less complete than MinIO's. boto3 compatibility is not guaranteed. |
+
+---
+
+## ADR-004: Docker Compose for Deployment
+
+**Date:** Project inception  
+**Status:** Accepted
+
+### Context
+
+Aegis is a multi-component platform deployed on-premise within organizations' security environments:
+
+- 4 services: Frontend (Nginx), Backend (Uvicorn), PostgreSQL, MinIO.
+- Target environments range from a single server to small clusters.
+- Security teams typically have Docker available but may not have Kubernetes.
+- The platform must be installable by a security engineer (not necessarily a DevOps specialist).
+- Both development and production environments should use the same orchestration approach for consistency.
+
+### Decision
+
+We chose **Docker Compose** as the deployment and orchestration tool, with two compose files:
+
+- `docker-compose.yml` — Development: source volumes mounted, dev servers, exposed ports.
+- `docker-compose.prod.yml` — Production: multi-stage builds, Nginx serving static assets, only frontend port exposed, `SECRET_KEY` required.
+
+Supporting infrastructure:
+- `scripts/install.sh` — Interactive production installer that generates secrets, prompts for configuration, writes `.env`, and runs `docker compose up -d --build`.
+- `scripts/init.sh` — Development setup that waits for services, runs migrations, and seeds data.
+- All services connected via a `aegis-network` bridge network.
+- Named volumes for PostgreSQL and MinIO data persistence.
+- Health checks on PostgreSQL (`pg_isready`) and backend (`/health`).
+- Service dependency ordering: backend waits for `postgres: service_healthy` and `minio: service_started`.
+
+### Consequences
+
+**Positive:**
+- Single-command deployment: `docker compose -f docker-compose.prod.yml up -d --build`.
+- The `install.sh` wizard makes production setup accessible to non-DevOps personnel.
+- Consistent environments between development and production (same containers, same network topology).
+- Named volumes survive container rebuilds — data persists across upgrades.
+- No external dependencies beyond Docker and Docker Compose.
+- Multi-stage Dockerfile for frontend produces a minimal Nginx image (~25MB) from a full Node.js build stage.
+- Non-root user (`appuser`, UID 1001) in backend Dockerfile follows container security best practices.
+
+**Negative:**
+- No built-in horizontal scaling — running multiple backend instances requires manual Nginx upstream configuration and a shared token blacklist (currently in-memory).
+- No rolling deployments — `docker compose up -d --build` causes brief downtime during image rebuilds.
+- No built-in secrets management — secrets are in `.env` files on the host filesystem.
+- No container orchestration beyond restart policies (`restart: always`).
+- No centralized logging — each container logs to its own stdout/stderr.
+
+**Risks:**
+- Single point of failure: if the host machine goes down, all services go down.
+- No automated backup strategy — `pg_dump` is documented but not automated.
+
+### Alternatives Considered
+
+| Alternative | Reason Rejected |
+|------------|-----------------|
+| **Kubernetes (k8s)** | Significantly higher operational complexity. Requires a cluster, kubectl expertise, Helm charts or manifests, ingress controllers, PVCs. Overkill for a single-server deployment targeting security teams. |
+| **Docker Swarm** | Adds orchestration complexity with minimal benefit over Compose for < 5 services. The project does not need multi-node scheduling or service mesh. Swarm's future is uncertain compared to Compose V2. |
+| **Bare metal / systemd** | Loses containerization benefits (isolation, reproducibility, dependency management). Would require manual installation of Python, Node.js, PostgreSQL, MinIO on each target system. |
+| **Ansible + Docker** | Adds a configuration management layer that is unnecessary for a 4-service application. Could be valuable in the future for multi-server deployments but is premature now. |
+
+---
+
+## ADR-005: Modular Monolith over Microservices
+
+**Date:** Project inception  
+**Status:** Accepted
+
+### Context
+
+Aegis has distinct functional domains that could theoretically be separate services:
+- **Test Workflow** — Red/Blue validation state machine, evidence management.
+- **Coverage Analytics** — Scoring engine, heatmaps, metrics, reports.
+- **Data Import** — 8 external source integrations (MITRE, Sigma, Elastic, CALDERA, etc.).
+- **Campaign Management** — Campaign lifecycle, scheduling, threat actor generation.
+- **Compliance** — Framework mappings, gap analysis, control tracking.
+- **User/Auth** — Authentication, RBAC, audit logging.
+
+However:
+- These domains share the same database and have tight data dependencies (e.g., scoring reads tests, techniques, detection rules, and D3FEND mappings in a single calculation).
+- The development team is small.
+- The deployment target is single-server Docker Compose.
+- Latency between services would complicate the scoring engine (which aggregates across 5+ tables).
+
+### Decision
+
+We chose a **modular monolith** architecture: a single deployable backend process organized into internal modules (routers, services, models) rather than separate microservices.
+
+Module boundaries:
+- **Routers** (21 files) — HTTP endpoint definitions grouped by domain.
+- **Services** (20 files) — Business logic grouped by capability (workflow, scoring, notifications, imports).
+- **Models** (18 files) — ORM entities grouped by domain concept.
+- **Schemas** (10 files) — Pydantic DTOs grouped by domain concept.
+
+All modules share a single database, a single process, and a single deployment artifact.
+
+### Consequences
+
+**Positive:**
+- No network overhead between domains — scoring can join 5+ tables in a single SQL query.
+- Single deployment artifact simplifies CI/CD, monitoring, and debugging.
+- Shared database means ACID transactions across domains (e.g., creating a test + logging the audit entry + sending a notification in one commit).
+- No service discovery, API gateways, circuit breakers, or distributed tracing needed.
+- Faster development iteration — change any module, rebuild one container.
+
+**Negative:**
+- All domains scale together — cannot scale the data import workers independently from the API.
+- A bug in one module (e.g., a memory leak in scoring) can crash the entire application.
+- Module boundaries are not enforced at the language level — routers currently import services and models freely across domains (e.g., `heatmap.py` imports 6 models from different domains).
+- The monolith has grown to 21 routers and 20 services without explicit boundary enforcement, leading to "fat controllers" and cross-cutting concerns.
+
+**Risks:**
+- Without explicit module boundaries (enforced by code structure or linting rules), the modular monolith can degrade into a traditional monolith where everything depends on everything.
+- The Clean Architecture refactor proposed in `ARCHITECTURAL_ANALYSIS.md` would restore module boundaries via the domain/application/infrastructure/presentation layers.
+
+### Alternatives Considered
+
+| Alternative | Reason Rejected |
+|------------|-----------------|
+| **Microservices** | The 8 data source integrations would each become a service, requiring inter-service communication for writing to the shared technique/rule tables. Scoring would need to call 3-4 services to gather data, adding latency and failure modes. Operational overhead (8+ containers, service mesh, distributed tracing) unjustified for a small team and single-server deployment. |
+| **Microservices with shared DB** | Anti-pattern. Multiple services sharing a database lose the main benefit of microservices (independent deployment and schema evolution) while keeping the operational complexity. |
+| **Modular monolith with enforced boundaries** | This is the recommended evolution (see ADR analysis). The current implementation has module structure but no boundary enforcement. Adding domain-layer interfaces (Protocol/ABC), a repository pattern, and import linting rules would achieve this without a microservices migration. |
+
+---
+
+## ADR-006: APScheduler In-Process over External Job System
+
+**Date:** Project inception  
+**Status:** Accepted
+
+### Context
+
+Aegis requires periodic background tasks:
+
+| Task | Frequency | Duration | Description |
+|------|-----------|----------|-------------|
+| MITRE ATT&CK sync | Every 24 hours | 30-120 seconds | Download STIX/TAXII feed, upsert ~700 techniques |
+| Intel scan | Every 7 days | 10-60 seconds | Scan threat intelligence sources |
+| Notification cleanup | Every 24 hours | < 5 seconds | Delete read notifications older than 90 days |
+| Coverage snapshot | Weekly (Sunday 00:00) | 5-30 seconds | Capture point-in-time coverage state across all techniques |
+| Recurring campaigns | Every 24 hours | < 10 seconds | Check and spawn due recurring test campaigns |
+
+Requirements:
+- Jobs must access the same database as the API.
+- Jobs must not block API request handling.
+- No additional infrastructure should be required beyond what Docker Compose already provides.
+- Job failure should not crash the API server.
+- Jobs do not need distributed execution (single-server deployment).
+
+### Decision
+
+We chose **APScheduler** (`BackgroundScheduler`) running as an in-process thread within the FastAPI application.
+
+Implementation details:
+- The scheduler is started during FastAPI's `lifespan` startup event and shut down on application exit.
+- Each job function creates its own `SessionLocal()` instance, independent from request-scoped sessions.
+- All jobs use try/except/finally to ensure sessions are closed even on failure.
+- Jobs are registered with `replace_existing=True` to handle server restarts cleanly.
+- The scheduler is a module-level singleton in `jobs/mitre_sync_job.py`.
+
+### Consequences
+
+**Positive:**
+- Zero additional infrastructure — no message broker, no worker containers, no job database.
+- Jobs share the same Python process, so they can import services directly (`sync_mitre`, `scan_intel`, `create_snapshot`, etc.) without serialization or RPC.
+- Simple debugging — job logs appear in the same stdout as API logs.
+- Session isolation per job prevents interference with request-scoped transactions.
+- `replace_existing=True` prevents duplicate job registrations on hot reload.
+
+**Negative:**
+- **No persistence:** If the server crashes mid-job, the job state is lost. There is no retry mechanism — the job simply runs again at the next scheduled interval.
+- **No distributed execution:** Cannot run jobs on a separate worker node. If the API is under heavy load, jobs compete for the same CPU and memory.
+- **No dead letter queue:** Failed jobs are logged but not queued for retry. A failed MITRE sync silently waits 24 hours before trying again.
+- **No job history:** There is no record of when jobs last ran, how long they took, or whether they succeeded — only log lines.
+- **Single-instance constraint:** If multiple backend instances are running (horizontal scaling), each instance runs its own scheduler, causing duplicate job execution (double MITRE sync, double snapshots, etc.).
+- **No manual trigger via scheduler:** Admin-triggered syncs go through the API endpoints (`/api/v1/system/*`), bypassing the scheduler entirely. There are effectively two paths to the same operations.
+
+**Risks:**
+- The single-instance constraint is the most significant risk. If Aegis scales horizontally, APScheduler must be replaced or augmented with a distributed lock (e.g., PostgreSQL advisory locks or Redis-based locking).
+
+### Alternatives Considered
+
+| Alternative | Reason Rejected |
+|------------|-----------------|
+| **Celery + Redis/RabbitMQ** | Requires an additional broker container (Redis or RabbitMQ), a separate worker process, and Celery configuration. Significant operational overhead for 5 periodic tasks that each run for < 2 minutes. Would be justified if job volume grows or horizontal scaling is needed. |
+| **Dramatiq + Redis** | Similar to Celery but lighter. Still requires a Redis container and a separate worker process. Same operational overhead concern. |
+| **Cron jobs (host-level)** | Would require the host to have cron configured and scripts that call API endpoints or run Python commands inside the container. Breaks the "single Docker Compose" deployment model. Not portable. |
+| **PostgreSQL `pg_cron`** | Runs inside the database, limited to SQL operations. Cannot execute Python logic (downloading ZIPs, parsing YAML, upserting with business rules). Would require stored procedures or external triggers. |
+| **Kubernetes CronJobs** | Requires Kubernetes. Not applicable to the Docker Compose deployment model (see ADR-004). |
+| **APScheduler with JobStore (PostgreSQL)** | APScheduler supports persistent job stores that would solve the single-instance problem via database locking. This is a viable evolution path — same library, minimal code change, adds distributed-safe execution. **Recommended as the first upgrade when horizontal scaling is needed.** |
+
+---
+
+## ADR Evolution Path
+
+The following table summarizes when each decision should be revisited:
+
+| ADR | Revisit When | Likely Evolution |
+|-----|-------------|-----------------|
+| ADR-001 (FastAPI) | Stable — no change needed | Add structured logging, OpenTelemetry tracing |
+| ADR-002 (PostgreSQL + JSONB) | JSONB query performance degrades | Add GIN indexes on JSONB columns, evaluate moving high-query fields to dedicated columns |
+| ADR-003 (MinIO) | Cloud deployment required | Swap boto3 endpoint to AWS S3 / GCS (zero code change) |
+| ADR-004 (Docker Compose) | Multi-server deployment needed | Migrate to Kubernetes with Helm charts, or add Ansible playbooks |
+| ADR-005 (Modular Monolith) | Team grows > 5 developers, or domains need independent scaling | Enforce boundaries first (Clean Architecture refactor), then extract high-traffic domains as services if needed |
+| ADR-006 (APScheduler) | Horizontal scaling required, or jobs need retry/history | Add APScheduler PostgreSQL JobStore first; migrate to Celery if job complexity grows significantly |