Files

Kitos 560fc0c9f0 refactor(detection-rules): extract query/business logic to detection_rule_service, router is thin HTTP adapter

2026-02-19 17:39:31 +01:00

26 KiB

Raw Permalink Blame History

Aegis — Architecture Decision Records (ADR)

Date: February 11, 2026
Status: All decisions are Accepted and currently in effect.

Index

ADR	Title	Status
ADR-001	FastAPI as Backend Framework	Accepted
ADR-002	PostgreSQL with JSONB as Primary Database	Accepted
ADR-003	MinIO for Evidence Storage	Accepted
ADR-004	Docker Compose for Deployment	Accepted
ADR-005	Modular Monolith over Microservices	Accepted
ADR-006	APScheduler In-Process over External Job System	Accepted

ADR-001: FastAPI as Backend Framework

Date: Project inception
Status: Accepted

Context

Aegis is an internal security platform for managing MITRE ATT&CK coverage through Red/Blue team validation workflows. The backend must:

Expose a REST API consumed by a React SPA (21 pages, 80+ endpoints).
Handle CRUD operations for 18+ domain entities with complex filtering and joins.
Support file uploads (evidence) and streaming downloads (CSV/JSON exports).
Integrate with external APIs (MITRE TAXII 2.0, GitHub REST, D3FEND REST).
Enforce RBAC authorization across 6 roles.
Be developed and maintained by a small team requiring fast iteration.
Run in a containerized environment with Python as the team's primary language.

Decision

We chose FastAPI as the backend framework, served by Uvicorn (ASGI).

Key factors:

Automatic OpenAPI/Swagger generation from type hints reduces documentation burden for 80+ endpoints.
Pydantic integration provides request/response validation with zero boilerplate, critical for a schema-heavy domain (test workflows, scoring payloads, compliance data).
Depends() system provides clean dependency injection for auth, DB sessions, and role checks without a third-party DI container.
Async-capable but allows synchronous route handlers, which matters because SQLAlchemy (sync) is the ORM and all external data imports are CPU/IO-bound synchronous operations.
Performance is sufficient for an internal tool (< 100 concurrent users) without needing Go/Rust-level throughput.
Python ecosystem gives direct access to taxii2-client, pySigma, boto3, PyYAML, and toml — all required for the 8 external data source integrations.

Consequences

Positive:

Swagger UI available in development (/docs) for rapid API exploration and testing.
Pydantic schemas act as living documentation for the API contract.
Depends() chain for get_db → get_current_user → require_role() is concise and composable.
python-jose + passlib integrate naturally for JWT/bcrypt auth.
SlowAPI integrates directly with FastAPI for rate limiting.

Negative:

The Depends() system encourages passing db: Session directly into route handlers, which has led to routers containing raw SQLAlchemy queries instead of delegating to a service/repository layer (see ADR analysis — 11 of 21 routers query the DB directly).
Synchronous route handlers block the event loop when performing long operations (MITRE sync ZIP downloads can take 30+ seconds), mitigated by Nginx proxy timeout of 300s.
No built-in background task system beyond BackgroundTasks (which is request-scoped), requiring APScheduler for scheduled jobs (see ADR-006).

Risks:

FastAPI's ease of putting logic in route handlers has contributed to "fat controllers" — this is a developer discipline issue, not a framework limitation.

Alternatives Considered

Alternative	Reason Rejected
Django + DRF	Heavier ORM opinions, admin panel unnecessary, slower startup. Django's ORM lacks SQLAlchemy's flexibility with JSONB and complex joins.
Flask + Flask-RESTful	No built-in validation, no auto-generated OpenAPI, manual Swagger setup. Would require marshmallow or similar for schema validation.
Go (Gin/Echo)	Team's primary expertise is Python. The 8 data source integrations rely heavily on Python libraries (pySigma, taxii2-client, PyYAML).
NestJS (Node.js)	Would split the team across two runtimes. Python libraries for STIX/TAXII and Sigma rule parsing have no mature Node.js equivalents.

ADR-002: PostgreSQL with JSONB as Primary Database

Date: Project inception
Status: Accepted

Context

Aegis manages a complex relational domain: techniques have tests, tests belong to campaigns, threat actors map to techniques, compliance controls map to techniques, detection rules map to techniques and tests. This is a deeply relational model with 18+ tables and many-to-many relationships.

However, several entities also carry semi-structured data that varies by source:

Audit logs — details field contains arbitrary action metadata (different structure per action type).
Threat actors — aliases, target_sectors, target_regions, references are variable-length arrays/objects from STIX 2.0 bundles.
Detection rules — platforms (array), log_sources (object with varying keys like product, service, category).
Data sources — last_sync_stats (object with import-specific counters), config (source-specific configuration).
Techniques — platforms (array of OS names from ATT&CK).
Campaigns — tags (user-defined array).

This data is imported from external sources with varying schemas (STIX JSON, Sigma YAML, Elastic TOML) and must be stored without rigid column definitions.

Decision

We chose PostgreSQL 15 as the primary database, using its native JSONB column type for semi-structured fields alongside traditional relational columns for the core domain.

The schema is managed by Alembic (18 migration versions) with SQLAlchemy ORM using sqlalchemy.dialects.postgresql.JSONB.

Consequences

Positive:

Relational integrity enforced with foreign keys for the core domain (test → technique, campaign → test, evidence → test, etc.).
JSONB columns store variable-structure data without schema migrations when external sources change their format.
JSONB supports GIN indexing for efficient containment queries (@> operator) on arrays like platforms and target_sectors.
Single database to operate — no need for a separate document store.
PostgreSQL's mature ecosystem: pg_dump for backups, pg_isready for health checks, extensive monitoring tooling.
SQLAlchemy's JSONB type allows Python dict/list access with full query support.

Negative:

JSONB fields bypass ORM-level validation — the schema for details, config, references etc. is only enforced by application code (Pydantic schemas on input), not by the database.
Complex queries mixing relational joins with JSONB containment can be harder to optimize and debug.
No GIN indexes are currently defined in migrations for JSONB columns, meaning array containment queries may perform full scans on large datasets.
JSONB fields in audit logs make structured querying across action types difficult (e.g., "find all audit entries where details.old_state = 'draft'").

Risks:

As JSONB usage grows, the boundary between "should be a column" and "should be JSONB" can blur. Currently well-contained to arrays and metadata fields.

Alternatives Considered

Alternative	Reason Rejected
PostgreSQL without JSONB	Would require separate junction tables for every array field (technique_platforms, actor_aliases, actor_sectors, etc.), adding 10+ tables for data that is always read as a whole array.
MongoDB	The core domain is deeply relational (techniques ↔ tests ↔ campaigns ↔ threat actors). Modeling this in MongoDB would require denormalization, embedded documents, or manual reference integrity — trading JSONB flexibility for relational integrity loss.
PostgreSQL + MongoDB (dual)	Operational complexity of two database systems is unjustified for the current JSONB usage (~12 columns across 6 tables).
MySQL 8 with JSON	PostgreSQL's JSONB is binary-indexed and faster for containment queries. MySQL's JSON type is text-based with function-based indexing. PostgreSQL also has superior support for UUID primary keys (native type vs BINARY(16)).

ADR-003: MinIO for Evidence Storage

Date: Project inception
Status: Accepted

Context

The Red/Blue team validation workflow requires both teams to upload evidence files (screenshots, log files, PCAPs, documents) to support their test findings. Requirements:

Files range from small screenshots (KB) to large PCAPs (hundreds of MB).
Files must be associated with specific tests and teams (red/blue).
Files must be downloadable by authorized users via the browser.
Storage must be independent from the application database (no BLOBs in PostgreSQL).
The platform is deployed on-premise via Docker Compose — cloud-native S3 is not available.
The upload/download API must be simple and well-supported in Python.

Decision

We chose MinIO as an S3-compatible object storage system, accessed via boto3 (AWS S3 SDK for Python).

Implementation details:

A single evidence bucket is auto-created on backend startup (ensure_bucket_exists()).
Files are uploaded with put_object() using a generated UUID-based key.
Downloads use presigned URLs (generate_presigned_url()) with 1-hour expiration.
The MinIO client is a module-level singleton in storage.py.
Evidence metadata (filename, MIME type, size, team, test association) is stored in PostgreSQL; only the binary content lives in MinIO.

Consequences

Positive:

S3-compatible API means zero code changes if migrating to AWS S3, GCS, or any S3-compatible service.
boto3 is the most mature and well-documented S3 client library in Python.
Presigned URLs offload download bandwidth from the backend — the browser fetches directly from MinIO.
Binary data stays out of PostgreSQL, keeping the database lean and backups fast.
MinIO runs as a single Docker container with a persistent volume — simple to deploy and back up.
MinIO Console (port 9001) provides a web UI for administrators to inspect stored files.

Negative:

Presigned URLs currently point to minio:9000 (Docker internal hostname), which is not accessible from the browser in production without additional Nginx configuration or a public MinIO endpoint.
No file virus scanning or content validation before storage.
No lifecycle policies configured (no automatic deletion of old evidence).
The module-level singleton client means the MinIO connection configuration cannot be changed at runtime (acceptable for the current deployment model).

Risks:

If MinIO container is lost and the volume is not backed up, all evidence files are permanently lost. Evidence metadata in PostgreSQL would reference non-existent files.

Alternatives Considered

Alternative	Reason Rejected
PostgreSQL BYTEA/BLOB	Storing binary files in the database bloats backups, degrades query performance, and makes streaming large files complex. PostgreSQL is not designed as a file store.
Local filesystem	Not portable across container restarts without host volume mounts. No presigned URL support, requiring the backend to proxy all downloads. No built-in replication or management UI.
AWS S3	Requires cloud account and internet connectivity. The platform is designed for on-premise deployment where external cloud services may not be permitted.
SeaweedFS	Less mature ecosystem, smaller community. The S3-compatible layer is less complete than MinIO's. boto3 compatibility is not guaranteed.

ADR-004: Docker Compose for Deployment

Date: Project inception
Status: Accepted

Context

Aegis is a multi-component platform deployed on-premise within organizations' security environments:

4 services: Frontend (Nginx), Backend (Uvicorn), PostgreSQL, MinIO.
Target environments range from a single server to small clusters.
Security teams typically have Docker available but may not have Kubernetes.
The platform must be installable by a security engineer (not necessarily a DevOps specialist).
Both development and production environments should use the same orchestration approach for consistency.

Decision

We chose Docker Compose as the deployment and orchestration tool, with two compose files:

docker-compose.yml — Development: source volumes mounted, dev servers, exposed ports.
docker-compose.prod.yml — Production: multi-stage builds, Nginx serving static assets, only frontend port exposed, SECRET_KEY required.

Supporting infrastructure:

scripts/install.sh — Interactive production installer that generates secrets, prompts for configuration, writes .env, and runs docker compose up -d --build.
scripts/init.sh — Development setup that waits for services, runs migrations, and seeds data.
All services connected via a aegis-network bridge network.
Named volumes for PostgreSQL and MinIO data persistence.
Health checks on PostgreSQL (pg_isready) and backend (/health).
Service dependency ordering: backend waits for postgres: service_healthy and minio: service_started.

Consequences

Positive:

Single-command deployment: docker compose -f docker-compose.prod.yml up -d --build.
The install.sh wizard makes production setup accessible to non-DevOps personnel.
Consistent environments between development and production (same containers, same network topology).
Named volumes survive container rebuilds — data persists across upgrades.
No external dependencies beyond Docker and Docker Compose.
Multi-stage Dockerfile for frontend produces a minimal Nginx image (~25MB) from a full Node.js build stage.
Non-root user (appuser, UID 1001) in backend Dockerfile follows container security best practices.

Negative:

No built-in horizontal scaling — running multiple backend instances requires manual Nginx upstream configuration and a shared token blacklist (currently in-memory).
No rolling deployments — docker compose up -d --build causes brief downtime during image rebuilds.
No built-in secrets management — secrets are in .env files on the host filesystem.
No container orchestration beyond restart policies (restart: always).
No centralized logging — each container logs to its own stdout/stderr.

Risks:

Single point of failure: if the host machine goes down, all services go down.
No automated backup strategy — pg_dump is documented but not automated.

Alternatives Considered

Alternative	Reason Rejected
Kubernetes (k8s)	Significantly higher operational complexity. Requires a cluster, kubectl expertise, Helm charts or manifests, ingress controllers, PVCs. Overkill for a single-server deployment targeting security teams.
Docker Swarm	Adds orchestration complexity with minimal benefit over Compose for < 5 services. The project does not need multi-node scheduling or service mesh. Swarm's future is uncertain compared to Compose V2.
Bare metal / systemd	Loses containerization benefits (isolation, reproducibility, dependency management). Would require manual installation of Python, Node.js, PostgreSQL, MinIO on each target system.
Ansible + Docker	Adds a configuration management layer that is unnecessary for a 4-service application. Could be valuable in the future for multi-server deployments but is premature now.

ADR-005: Modular Monolith over Microservices

Date: Project inception
Status: Accepted

Context

Aegis has distinct functional domains that could theoretically be separate services:

Test Workflow — Red/Blue validation state machine, evidence management.
Coverage Analytics — Scoring engine, heatmaps, metrics, reports.
Data Import — 8 external source integrations (MITRE, Sigma, Elastic, CALDERA, etc.).
Campaign Management — Campaign lifecycle, scheduling, threat actor generation.
Compliance — Framework mappings, gap analysis, control tracking.
User/Auth — Authentication, RBAC, audit logging.

However:

These domains share the same database and have tight data dependencies (e.g., scoring reads tests, techniques, detection rules, and D3FEND mappings in a single calculation).
The development team is small.
The deployment target is single-server Docker Compose.
Latency between services would complicate the scoring engine (which aggregates across 5+ tables).

Decision

We chose a modular monolith architecture: a single deployable backend process organized into internal modules (routers, services, models) rather than separate microservices.

Module boundaries:

Routers (21 files) — HTTP endpoint definitions grouped by domain.
Services (20 files) — Business logic grouped by capability (workflow, scoring, notifications, imports).
Models (18 files) — ORM entities grouped by domain concept.
Schemas (10 files) — Pydantic DTOs grouped by domain concept.

All modules share a single database, a single process, and a single deployment artifact.

Consequences

Positive:

No network overhead between domains — scoring can join 5+ tables in a single SQL query.
Single deployment artifact simplifies CI/CD, monitoring, and debugging.
Shared database means ACID transactions across domains (e.g., creating a test + logging the audit entry + sending a notification in one commit).
No service discovery, API gateways, circuit breakers, or distributed tracing needed.
Faster development iteration — change any module, rebuild one container.

Negative:

All domains scale together — cannot scale the data import workers independently from the API.
A bug in one module (e.g., a memory leak in scoring) can crash the entire application.
Module boundaries are not enforced at the language level — routers currently import services and models freely across domains (e.g., heatmap.py imports 6 models from different domains).
The monolith has grown to 21 routers and 20 services without explicit boundary enforcement, leading to "fat controllers" and cross-cutting concerns.

Risks:

Without explicit module boundaries (enforced by code structure or linting rules), the modular monolith can degrade into a traditional monolith where everything depends on everything.
The Clean Architecture refactor proposed in ARCHITECTURAL_ANALYSIS.md would restore module boundaries via the domain/application/infrastructure/presentation layers.

Alternatives Considered

Alternative	Reason Rejected
Microservices	The 8 data source integrations would each become a service, requiring inter-service communication for writing to the shared technique/rule tables. Scoring would need to call 3-4 services to gather data, adding latency and failure modes. Operational overhead (8+ containers, service mesh, distributed tracing) unjustified for a small team and single-server deployment.
Microservices with shared DB	Anti-pattern. Multiple services sharing a database lose the main benefit of microservices (independent deployment and schema evolution) while keeping the operational complexity.
Modular monolith with enforced boundaries	This is the recommended evolution (see ADR analysis). The current implementation has module structure but no boundary enforcement. Adding domain-layer interfaces (Protocol/ABC), a repository pattern, and import linting rules would achieve this without a microservices migration.

ADR-006: APScheduler In-Process over External Job System

Date: Project inception
Status: Accepted

Context

Aegis requires periodic background tasks:

Task	Frequency	Duration	Description
MITRE ATT&CK sync	Every 24 hours	30-120 seconds	Download STIX/TAXII feed, upsert ~700 techniques
Intel scan	Every 7 days	10-60 seconds	Scan threat intelligence sources
Notification cleanup	Every 24 hours	< 5 seconds	Delete read notifications older than 90 days
Coverage snapshot	Weekly (Sunday 00:00)	5-30 seconds	Capture point-in-time coverage state across all techniques
Recurring campaigns	Every 24 hours	< 10 seconds	Check and spawn due recurring test campaigns

Requirements:

Jobs must access the same database as the API.
Jobs must not block API request handling.
No additional infrastructure should be required beyond what Docker Compose already provides.
Job failure should not crash the API server.
Jobs do not need distributed execution (single-server deployment).

Decision

We chose APScheduler (BackgroundScheduler) running as an in-process thread within the FastAPI application.

Implementation details:

The scheduler is started during FastAPI's lifespan startup event and shut down on application exit.
Each job function creates its own SessionLocal() instance, independent from request-scoped sessions.
All jobs use try/except/finally to ensure sessions are closed even on failure.
Jobs are registered with replace_existing=True to handle server restarts cleanly.
The scheduler is a module-level singleton in jobs/mitre_sync_job.py.

Consequences

Positive:

Zero additional infrastructure — no message broker, no worker containers, no job database.
Jobs share the same Python process, so they can import services directly (sync_mitre, scan_intel, create_snapshot, etc.) without serialization or RPC.
Simple debugging — job logs appear in the same stdout as API logs.
Session isolation per job prevents interference with request-scoped transactions.
replace_existing=True prevents duplicate job registrations on hot reload.

Negative:

No persistence: If the server crashes mid-job, the job state is lost. There is no retry mechanism — the job simply runs again at the next scheduled interval.
No distributed execution: Cannot run jobs on a separate worker node. If the API is under heavy load, jobs compete for the same CPU and memory.
No dead letter queue: Failed jobs are logged but not queued for retry. A failed MITRE sync silently waits 24 hours before trying again.
No job history: There is no record of when jobs last ran, how long they took, or whether they succeeded — only log lines.
Single-instance constraint: If multiple backend instances are running (horizontal scaling), each instance runs its own scheduler, causing duplicate job execution (double MITRE sync, double snapshots, etc.).
No manual trigger via scheduler: Admin-triggered syncs go through the API endpoints (/api/v1/system/*), bypassing the scheduler entirely. There are effectively two paths to the same operations.

Risks:

The single-instance constraint is the most significant risk. If Aegis scales horizontally, APScheduler must be replaced or augmented with a distributed lock (e.g., PostgreSQL advisory locks or Redis-based locking).

Alternatives Considered

Alternative	Reason Rejected
Celery + Redis/RabbitMQ	Requires an additional broker container (Redis or RabbitMQ), a separate worker process, and Celery configuration. Significant operational overhead for 5 periodic tasks that each run for < 2 minutes. Would be justified if job volume grows or horizontal scaling is needed.
Dramatiq + Redis	Similar to Celery but lighter. Still requires a Redis container and a separate worker process. Same operational overhead concern.
Cron jobs (host-level)	Would require the host to have cron configured and scripts that call API endpoints or run Python commands inside the container. Breaks the "single Docker Compose" deployment model. Not portable.
PostgreSQL `pg_cron`	Runs inside the database, limited to SQL operations. Cannot execute Python logic (downloading ZIPs, parsing YAML, upserting with business rules). Would require stored procedures or external triggers.
Kubernetes CronJobs	Requires Kubernetes. Not applicable to the Docker Compose deployment model (see ADR-004).
APScheduler with JobStore (PostgreSQL)	APScheduler supports persistent job stores that would solve the single-instance problem via database locking. This is a viable evolution path — same library, minimal code change, adds distributed-safe execution. Recommended as the first upgrade when horizontal scaling is needed.

ADR Evolution Path

The following table summarizes when each decision should be revisited:

ADR	Revisit When	Likely Evolution
ADR-001 (FastAPI)	Stable — no change needed	Add structured logging, OpenTelemetry tracing
ADR-002 (PostgreSQL + JSONB)	JSONB query performance degrades	Add GIN indexes on JSONB columns, evaluate moving high-query fields to dedicated columns
ADR-003 (MinIO)	Cloud deployment required	Swap boto3 endpoint to AWS S3 / GCS (zero code change)
ADR-004 (Docker Compose)	Multi-server deployment needed	Migrate to Kubernetes with Helm charts, or add Ansible playbooks
ADR-005 (Modular Monolith)	Team grows > 5 developers, or domains need independent scaling	Enforce boundaries first (Clean Architecture refactor), then extract high-traffic domains as services if needed
ADR-006 (APScheduler)	Horizontal scaling required, or jobs need retry/history	Add APScheduler PostgreSQL JobStore first; migrate to Celery if job complexity grows significantly

26 KiB Raw Permalink Blame History

Aegis — Architecture Decision Records (ADR)

Index

ADR-001: FastAPI as Backend Framework

Context

Decision

Consequences

Alternatives Considered

ADR-002: PostgreSQL with JSONB as Primary Database

Context

Decision

Consequences

Alternatives Considered

ADR-003: MinIO for Evidence Storage

Context

Decision

Consequences

Alternatives Considered

ADR-004: Docker Compose for Deployment

Context

Decision

Consequences

Alternatives Considered

ADR-005: Modular Monolith over Microservices

Context

Decision

Consequences

Alternatives Considered

ADR-006: APScheduler In-Process over External Job System

Context

Decision

Consequences

Alternatives Considered

ADR Evolution Path

26 KiB

Raw Permalink Blame History