refactor(detection-rules): extract query/business logic to detection_rule_service, router is thin HTTP adapter

This commit is contained in:
2026-02-19 17:39:31 +01:00
parent d305db8794
commit 560fc0c9f0
7 changed files with 5853 additions and 282 deletions

394
docs/ADR.md Normal file
View File

@@ -0,0 +1,394 @@
# Aegis — Architecture Decision Records (ADR)
> **Date:** February 11, 2026
> **Status:** All decisions are **Accepted** and currently in effect.
---
## Index
| ADR | Title | Status |
|-----|-------|--------|
| [ADR-001](#adr-001-fastapi-as-backend-framework) | FastAPI as Backend Framework | Accepted |
| [ADR-002](#adr-002-postgresql-with-jsonb-as-primary-database) | PostgreSQL with JSONB as Primary Database | Accepted |
| [ADR-003](#adr-003-minio-for-evidence-storage) | MinIO for Evidence Storage | Accepted |
| [ADR-004](#adr-004-docker-compose-for-deployment) | Docker Compose for Deployment | Accepted |
| [ADR-005](#adr-005-modular-monolith-over-microservices) | Modular Monolith over Microservices | Accepted |
| [ADR-006](#adr-006-apscheduler-in-process-over-external-job-system) | APScheduler In-Process over External Job System | Accepted |
---
## ADR-001: FastAPI as Backend Framework
**Date:** Project inception
**Status:** Accepted
### Context
Aegis is an internal security platform for managing MITRE ATT&CK coverage through Red/Blue team validation workflows. The backend must:
- Expose a REST API consumed by a React SPA (21 pages, 80+ endpoints).
- Handle CRUD operations for 18+ domain entities with complex filtering and joins.
- Support file uploads (evidence) and streaming downloads (CSV/JSON exports).
- Integrate with external APIs (MITRE TAXII 2.0, GitHub REST, D3FEND REST).
- Enforce RBAC authorization across 6 roles.
- Be developed and maintained by a small team requiring fast iteration.
- Run in a containerized environment with Python as the team's primary language.
### Decision
We chose **FastAPI** as the backend framework, served by **Uvicorn** (ASGI).
Key factors:
- **Automatic OpenAPI/Swagger** generation from type hints reduces documentation burden for 80+ endpoints.
- **Pydantic integration** provides request/response validation with zero boilerplate, critical for a schema-heavy domain (test workflows, scoring payloads, compliance data).
- **`Depends()` system** provides clean dependency injection for auth, DB sessions, and role checks without a third-party DI container.
- **Async-capable** but allows synchronous route handlers, which matters because SQLAlchemy (sync) is the ORM and all external data imports are CPU/IO-bound synchronous operations.
- **Performance** is sufficient for an internal tool (< 100 concurrent users) without needing Go/Rust-level throughput.
- **Python ecosystem** gives direct access to `taxii2-client`, `pySigma`, `boto3`, `PyYAML`, and `toml` — all required for the 8 external data source integrations.
### Consequences
**Positive:**
- Swagger UI available in development (`/docs`) for rapid API exploration and testing.
- Pydantic schemas act as living documentation for the API contract.
- `Depends()` chain for `get_db``get_current_user``require_role()` is concise and composable.
- `python-jose` + `passlib` integrate naturally for JWT/bcrypt auth.
- SlowAPI integrates directly with FastAPI for rate limiting.
**Negative:**
- The `Depends()` system encourages passing `db: Session` directly into route handlers, which has led to routers containing raw SQLAlchemy queries instead of delegating to a service/repository layer (see ADR analysis — 11 of 21 routers query the DB directly).
- Synchronous route handlers block the event loop when performing long operations (MITRE sync ZIP downloads can take 30+ seconds), mitigated by Nginx proxy timeout of 300s.
- No built-in background task system beyond `BackgroundTasks` (which is request-scoped), requiring APScheduler for scheduled jobs (see ADR-006).
**Risks:**
- FastAPI's ease of putting logic in route handlers has contributed to "fat controllers" — this is a developer discipline issue, not a framework limitation.
### Alternatives Considered
| Alternative | Reason Rejected |
|------------|-----------------|
| **Django + DRF** | Heavier ORM opinions, admin panel unnecessary, slower startup. Django's ORM lacks SQLAlchemy's flexibility with JSONB and complex joins. |
| **Flask + Flask-RESTful** | No built-in validation, no auto-generated OpenAPI, manual Swagger setup. Would require marshmallow or similar for schema validation. |
| **Go (Gin/Echo)** | Team's primary expertise is Python. The 8 data source integrations rely heavily on Python libraries (pySigma, taxii2-client, PyYAML). |
| **NestJS (Node.js)** | Would split the team across two runtimes. Python libraries for STIX/TAXII and Sigma rule parsing have no mature Node.js equivalents. |
---
## ADR-002: PostgreSQL with JSONB as Primary Database
**Date:** Project inception
**Status:** Accepted
### Context
Aegis manages a complex relational domain: techniques have tests, tests belong to campaigns, threat actors map to techniques, compliance controls map to techniques, detection rules map to techniques and tests. This is a deeply relational model with 18+ tables and many-to-many relationships.
However, several entities also carry semi-structured data that varies by source:
- **Audit logs** — `details` field contains arbitrary action metadata (different structure per action type).
- **Threat actors** — `aliases`, `target_sectors`, `target_regions`, `references` are variable-length arrays/objects from STIX 2.0 bundles.
- **Detection rules** — `platforms` (array), `log_sources` (object with varying keys like `product`, `service`, `category`).
- **Data sources** — `last_sync_stats` (object with import-specific counters), `config` (source-specific configuration).
- **Techniques** — `platforms` (array of OS names from ATT&CK).
- **Campaigns** — `tags` (user-defined array).
This data is imported from external sources with varying schemas (STIX JSON, Sigma YAML, Elastic TOML) and must be stored without rigid column definitions.
### Decision
We chose **PostgreSQL 15** as the primary database, using its native **JSONB** column type for semi-structured fields alongside traditional relational columns for the core domain.
The schema is managed by **Alembic** (18 migration versions) with **SQLAlchemy** ORM using `sqlalchemy.dialects.postgresql.JSONB`.
### Consequences
**Positive:**
- Relational integrity enforced with foreign keys for the core domain (test → technique, campaign → test, evidence → test, etc.).
- JSONB columns store variable-structure data without schema migrations when external sources change their format.
- JSONB supports GIN indexing for efficient containment queries (`@>` operator) on arrays like `platforms` and `target_sectors`.
- Single database to operate — no need for a separate document store.
- PostgreSQL's mature ecosystem: `pg_dump` for backups, `pg_isready` for health checks, extensive monitoring tooling.
- SQLAlchemy's `JSONB` type allows Python dict/list access with full query support.
**Negative:**
- JSONB fields bypass ORM-level validation — the schema for `details`, `config`, `references` etc. is only enforced by application code (Pydantic schemas on input), not by the database.
- Complex queries mixing relational joins with JSONB containment can be harder to optimize and debug.
- No GIN indexes are currently defined in migrations for JSONB columns, meaning array containment queries may perform full scans on large datasets.
- JSONB fields in audit logs make structured querying across action types difficult (e.g., "find all audit entries where details.old_state = 'draft'").
**Risks:**
- As JSONB usage grows, the boundary between "should be a column" and "should be JSONB" can blur. Currently well-contained to arrays and metadata fields.
### Alternatives Considered
| Alternative | Reason Rejected |
|------------|-----------------|
| **PostgreSQL without JSONB** | Would require separate junction tables for every array field (technique_platforms, actor_aliases, actor_sectors, etc.), adding 10+ tables for data that is always read as a whole array. |
| **MongoDB** | The core domain is deeply relational (techniques ↔ tests ↔ campaigns ↔ threat actors). Modeling this in MongoDB would require denormalization, embedded documents, or manual reference integrity — trading JSONB flexibility for relational integrity loss. |
| **PostgreSQL + MongoDB (dual)** | Operational complexity of two database systems is unjustified for the current JSONB usage (~12 columns across 6 tables). |
| **MySQL 8 with JSON** | PostgreSQL's JSONB is binary-indexed and faster for containment queries. MySQL's JSON type is text-based with function-based indexing. PostgreSQL also has superior support for UUID primary keys (native type vs BINARY(16)). |
---
## ADR-003: MinIO for Evidence Storage
**Date:** Project inception
**Status:** Accepted
### Context
The Red/Blue team validation workflow requires both teams to upload evidence files (screenshots, log files, PCAPs, documents) to support their test findings. Requirements:
- Files range from small screenshots (KB) to large PCAPs (hundreds of MB).
- Files must be associated with specific tests and teams (red/blue).
- Files must be downloadable by authorized users via the browser.
- Storage must be independent from the application database (no BLOBs in PostgreSQL).
- The platform is deployed on-premise via Docker Compose — cloud-native S3 is not available.
- The upload/download API must be simple and well-supported in Python.
### Decision
We chose **MinIO** as an S3-compatible object storage system, accessed via **boto3** (AWS S3 SDK for Python).
Implementation details:
- A single `evidence` bucket is auto-created on backend startup (`ensure_bucket_exists()`).
- Files are uploaded with `put_object()` using a generated UUID-based key.
- Downloads use presigned URLs (`generate_presigned_url()`) with 1-hour expiration.
- The MinIO client is a module-level singleton in `storage.py`.
- Evidence metadata (filename, MIME type, size, team, test association) is stored in PostgreSQL; only the binary content lives in MinIO.
### Consequences
**Positive:**
- S3-compatible API means zero code changes if migrating to AWS S3, GCS, or any S3-compatible service.
- boto3 is the most mature and well-documented S3 client library in Python.
- Presigned URLs offload download bandwidth from the backend — the browser fetches directly from MinIO.
- Binary data stays out of PostgreSQL, keeping the database lean and backups fast.
- MinIO runs as a single Docker container with a persistent volume — simple to deploy and back up.
- MinIO Console (port 9001) provides a web UI for administrators to inspect stored files.
**Negative:**
- Presigned URLs currently point to `minio:9000` (Docker internal hostname), which is not accessible from the browser in production without additional Nginx configuration or a public MinIO endpoint.
- No file virus scanning or content validation before storage.
- No lifecycle policies configured (no automatic deletion of old evidence).
- The module-level singleton client means the MinIO connection configuration cannot be changed at runtime (acceptable for the current deployment model).
**Risks:**
- If MinIO container is lost and the volume is not backed up, all evidence files are permanently lost. Evidence metadata in PostgreSQL would reference non-existent files.
### Alternatives Considered
| Alternative | Reason Rejected |
|------------|-----------------|
| **PostgreSQL BYTEA/BLOB** | Storing binary files in the database bloats backups, degrades query performance, and makes streaming large files complex. PostgreSQL is not designed as a file store. |
| **Local filesystem** | Not portable across container restarts without host volume mounts. No presigned URL support, requiring the backend to proxy all downloads. No built-in replication or management UI. |
| **AWS S3** | Requires cloud account and internet connectivity. The platform is designed for on-premise deployment where external cloud services may not be permitted. |
| **SeaweedFS** | Less mature ecosystem, smaller community. The S3-compatible layer is less complete than MinIO's. boto3 compatibility is not guaranteed. |
---
## ADR-004: Docker Compose for Deployment
**Date:** Project inception
**Status:** Accepted
### Context
Aegis is a multi-component platform deployed on-premise within organizations' security environments:
- 4 services: Frontend (Nginx), Backend (Uvicorn), PostgreSQL, MinIO.
- Target environments range from a single server to small clusters.
- Security teams typically have Docker available but may not have Kubernetes.
- The platform must be installable by a security engineer (not necessarily a DevOps specialist).
- Both development and production environments should use the same orchestration approach for consistency.
### Decision
We chose **Docker Compose** as the deployment and orchestration tool, with two compose files:
- `docker-compose.yml` — Development: source volumes mounted, dev servers, exposed ports.
- `docker-compose.prod.yml` — Production: multi-stage builds, Nginx serving static assets, only frontend port exposed, `SECRET_KEY` required.
Supporting infrastructure:
- `scripts/install.sh` — Interactive production installer that generates secrets, prompts for configuration, writes `.env`, and runs `docker compose up -d --build`.
- `scripts/init.sh` — Development setup that waits for services, runs migrations, and seeds data.
- All services connected via a `aegis-network` bridge network.
- Named volumes for PostgreSQL and MinIO data persistence.
- Health checks on PostgreSQL (`pg_isready`) and backend (`/health`).
- Service dependency ordering: backend waits for `postgres: service_healthy` and `minio: service_started`.
### Consequences
**Positive:**
- Single-command deployment: `docker compose -f docker-compose.prod.yml up -d --build`.
- The `install.sh` wizard makes production setup accessible to non-DevOps personnel.
- Consistent environments between development and production (same containers, same network topology).
- Named volumes survive container rebuilds — data persists across upgrades.
- No external dependencies beyond Docker and Docker Compose.
- Multi-stage Dockerfile for frontend produces a minimal Nginx image (~25MB) from a full Node.js build stage.
- Non-root user (`appuser`, UID 1001) in backend Dockerfile follows container security best practices.
**Negative:**
- No built-in horizontal scaling — running multiple backend instances requires manual Nginx upstream configuration and a shared token blacklist (currently in-memory).
- No rolling deployments — `docker compose up -d --build` causes brief downtime during image rebuilds.
- No built-in secrets management — secrets are in `.env` files on the host filesystem.
- No container orchestration beyond restart policies (`restart: always`).
- No centralized logging — each container logs to its own stdout/stderr.
**Risks:**
- Single point of failure: if the host machine goes down, all services go down.
- No automated backup strategy — `pg_dump` is documented but not automated.
### Alternatives Considered
| Alternative | Reason Rejected |
|------------|-----------------|
| **Kubernetes (k8s)** | Significantly higher operational complexity. Requires a cluster, kubectl expertise, Helm charts or manifests, ingress controllers, PVCs. Overkill for a single-server deployment targeting security teams. |
| **Docker Swarm** | Adds orchestration complexity with minimal benefit over Compose for < 5 services. The project does not need multi-node scheduling or service mesh. Swarm's future is uncertain compared to Compose V2. |
| **Bare metal / systemd** | Loses containerization benefits (isolation, reproducibility, dependency management). Would require manual installation of Python, Node.js, PostgreSQL, MinIO on each target system. |
| **Ansible + Docker** | Adds a configuration management layer that is unnecessary for a 4-service application. Could be valuable in the future for multi-server deployments but is premature now. |
---
## ADR-005: Modular Monolith over Microservices
**Date:** Project inception
**Status:** Accepted
### Context
Aegis has distinct functional domains that could theoretically be separate services:
- **Test Workflow** — Red/Blue validation state machine, evidence management.
- **Coverage Analytics** — Scoring engine, heatmaps, metrics, reports.
- **Data Import** — 8 external source integrations (MITRE, Sigma, Elastic, CALDERA, etc.).
- **Campaign Management** — Campaign lifecycle, scheduling, threat actor generation.
- **Compliance** — Framework mappings, gap analysis, control tracking.
- **User/Auth** — Authentication, RBAC, audit logging.
However:
- These domains share the same database and have tight data dependencies (e.g., scoring reads tests, techniques, detection rules, and D3FEND mappings in a single calculation).
- The development team is small.
- The deployment target is single-server Docker Compose.
- Latency between services would complicate the scoring engine (which aggregates across 5+ tables).
### Decision
We chose a **modular monolith** architecture: a single deployable backend process organized into internal modules (routers, services, models) rather than separate microservices.
Module boundaries:
- **Routers** (21 files) — HTTP endpoint definitions grouped by domain.
- **Services** (20 files) — Business logic grouped by capability (workflow, scoring, notifications, imports).
- **Models** (18 files) — ORM entities grouped by domain concept.
- **Schemas** (10 files) — Pydantic DTOs grouped by domain concept.
All modules share a single database, a single process, and a single deployment artifact.
### Consequences
**Positive:**
- No network overhead between domains — scoring can join 5+ tables in a single SQL query.
- Single deployment artifact simplifies CI/CD, monitoring, and debugging.
- Shared database means ACID transactions across domains (e.g., creating a test + logging the audit entry + sending a notification in one commit).
- No service discovery, API gateways, circuit breakers, or distributed tracing needed.
- Faster development iteration — change any module, rebuild one container.
**Negative:**
- All domains scale together — cannot scale the data import workers independently from the API.
- A bug in one module (e.g., a memory leak in scoring) can crash the entire application.
- Module boundaries are not enforced at the language level — routers currently import services and models freely across domains (e.g., `heatmap.py` imports 6 models from different domains).
- The monolith has grown to 21 routers and 20 services without explicit boundary enforcement, leading to "fat controllers" and cross-cutting concerns.
**Risks:**
- Without explicit module boundaries (enforced by code structure or linting rules), the modular monolith can degrade into a traditional monolith where everything depends on everything.
- The Clean Architecture refactor proposed in `ARCHITECTURAL_ANALYSIS.md` would restore module boundaries via the domain/application/infrastructure/presentation layers.
### Alternatives Considered
| Alternative | Reason Rejected |
|------------|-----------------|
| **Microservices** | The 8 data source integrations would each become a service, requiring inter-service communication for writing to the shared technique/rule tables. Scoring would need to call 3-4 services to gather data, adding latency and failure modes. Operational overhead (8+ containers, service mesh, distributed tracing) unjustified for a small team and single-server deployment. |
| **Microservices with shared DB** | Anti-pattern. Multiple services sharing a database lose the main benefit of microservices (independent deployment and schema evolution) while keeping the operational complexity. |
| **Modular monolith with enforced boundaries** | This is the recommended evolution (see ADR analysis). The current implementation has module structure but no boundary enforcement. Adding domain-layer interfaces (Protocol/ABC), a repository pattern, and import linting rules would achieve this without a microservices migration. |
---
## ADR-006: APScheduler In-Process over External Job System
**Date:** Project inception
**Status:** Accepted
### Context
Aegis requires periodic background tasks:
| Task | Frequency | Duration | Description |
|------|-----------|----------|-------------|
| MITRE ATT&CK sync | Every 24 hours | 30-120 seconds | Download STIX/TAXII feed, upsert ~700 techniques |
| Intel scan | Every 7 days | 10-60 seconds | Scan threat intelligence sources |
| Notification cleanup | Every 24 hours | < 5 seconds | Delete read notifications older than 90 days |
| Coverage snapshot | Weekly (Sunday 00:00) | 5-30 seconds | Capture point-in-time coverage state across all techniques |
| Recurring campaigns | Every 24 hours | < 10 seconds | Check and spawn due recurring test campaigns |
Requirements:
- Jobs must access the same database as the API.
- Jobs must not block API request handling.
- No additional infrastructure should be required beyond what Docker Compose already provides.
- Job failure should not crash the API server.
- Jobs do not need distributed execution (single-server deployment).
### Decision
We chose **APScheduler** (`BackgroundScheduler`) running as an in-process thread within the FastAPI application.
Implementation details:
- The scheduler is started during FastAPI's `lifespan` startup event and shut down on application exit.
- Each job function creates its own `SessionLocal()` instance, independent from request-scoped sessions.
- All jobs use try/except/finally to ensure sessions are closed even on failure.
- Jobs are registered with `replace_existing=True` to handle server restarts cleanly.
- The scheduler is a module-level singleton in `jobs/mitre_sync_job.py`.
### Consequences
**Positive:**
- Zero additional infrastructure — no message broker, no worker containers, no job database.
- Jobs share the same Python process, so they can import services directly (`sync_mitre`, `scan_intel`, `create_snapshot`, etc.) without serialization or RPC.
- Simple debugging — job logs appear in the same stdout as API logs.
- Session isolation per job prevents interference with request-scoped transactions.
- `replace_existing=True` prevents duplicate job registrations on hot reload.
**Negative:**
- **No persistence:** If the server crashes mid-job, the job state is lost. There is no retry mechanism — the job simply runs again at the next scheduled interval.
- **No distributed execution:** Cannot run jobs on a separate worker node. If the API is under heavy load, jobs compete for the same CPU and memory.
- **No dead letter queue:** Failed jobs are logged but not queued for retry. A failed MITRE sync silently waits 24 hours before trying again.
- **No job history:** There is no record of when jobs last ran, how long they took, or whether they succeeded — only log lines.
- **Single-instance constraint:** If multiple backend instances are running (horizontal scaling), each instance runs its own scheduler, causing duplicate job execution (double MITRE sync, double snapshots, etc.).
- **No manual trigger via scheduler:** Admin-triggered syncs go through the API endpoints (`/api/v1/system/*`), bypassing the scheduler entirely. There are effectively two paths to the same operations.
**Risks:**
- The single-instance constraint is the most significant risk. If Aegis scales horizontally, APScheduler must be replaced or augmented with a distributed lock (e.g., PostgreSQL advisory locks or Redis-based locking).
### Alternatives Considered
| Alternative | Reason Rejected |
|------------|-----------------|
| **Celery + Redis/RabbitMQ** | Requires an additional broker container (Redis or RabbitMQ), a separate worker process, and Celery configuration. Significant operational overhead for 5 periodic tasks that each run for < 2 minutes. Would be justified if job volume grows or horizontal scaling is needed. |
| **Dramatiq + Redis** | Similar to Celery but lighter. Still requires a Redis container and a separate worker process. Same operational overhead concern. |
| **Cron jobs (host-level)** | Would require the host to have cron configured and scripts that call API endpoints or run Python commands inside the container. Breaks the "single Docker Compose" deployment model. Not portable. |
| **PostgreSQL `pg_cron`** | Runs inside the database, limited to SQL operations. Cannot execute Python logic (downloading ZIPs, parsing YAML, upserting with business rules). Would require stored procedures or external triggers. |
| **Kubernetes CronJobs** | Requires Kubernetes. Not applicable to the Docker Compose deployment model (see ADR-004). |
| **APScheduler with JobStore (PostgreSQL)** | APScheduler supports persistent job stores that would solve the single-instance problem via database locking. This is a viable evolution path — same library, minimal code change, adds distributed-safe execution. **Recommended as the first upgrade when horizontal scaling is needed.** |
---
## ADR Evolution Path
The following table summarizes when each decision should be revisited:
| ADR | Revisit When | Likely Evolution |
|-----|-------------|-----------------|
| ADR-001 (FastAPI) | Stable — no change needed | Add structured logging, OpenTelemetry tracing |
| ADR-002 (PostgreSQL + JSONB) | JSONB query performance degrades | Add GIN indexes on JSONB columns, evaluate moving high-query fields to dedicated columns |
| ADR-003 (MinIO) | Cloud deployment required | Swap boto3 endpoint to AWS S3 / GCS (zero code change) |
| ADR-004 (Docker Compose) | Multi-server deployment needed | Migrate to Kubernetes with Helm charts, or add Ansible playbooks |
| ADR-005 (Modular Monolith) | Team grows > 5 developers, or domains need independent scaling | Enforce boundaries first (Clean Architecture refactor), then extract high-traffic domains as services if needed |
| ADR-006 (APScheduler) | Horizontal scaling required, or jobs need retry/history | Add APScheduler PostgreSQL JobStore first; migrate to Celery if job complexity grows significantly |