26 KiB
Aegis — Architecture Decision Records (ADR)
Date: February 11, 2026
Status: All decisions are Accepted and currently in effect.
Index
| ADR | Title | Status |
|---|---|---|
| ADR-001 | FastAPI as Backend Framework | Accepted |
| ADR-002 | PostgreSQL with JSONB as Primary Database | Accepted |
| ADR-003 | MinIO for Evidence Storage | Accepted |
| ADR-004 | Docker Compose for Deployment | Accepted |
| ADR-005 | Modular Monolith over Microservices | Accepted |
| ADR-006 | APScheduler In-Process over External Job System | Accepted |
ADR-001: FastAPI as Backend Framework
Date: Project inception
Status: Accepted
Context
Aegis is an internal security platform for managing MITRE ATT&CK coverage through Red/Blue team validation workflows. The backend must:
- Expose a REST API consumed by a React SPA (21 pages, 80+ endpoints).
- Handle CRUD operations for 18+ domain entities with complex filtering and joins.
- Support file uploads (evidence) and streaming downloads (CSV/JSON exports).
- Integrate with external APIs (MITRE TAXII 2.0, GitHub REST, D3FEND REST).
- Enforce RBAC authorization across 6 roles.
- Be developed and maintained by a small team requiring fast iteration.
- Run in a containerized environment with Python as the team's primary language.
Decision
We chose FastAPI as the backend framework, served by Uvicorn (ASGI).
Key factors:
- Automatic OpenAPI/Swagger generation from type hints reduces documentation burden for 80+ endpoints.
- Pydantic integration provides request/response validation with zero boilerplate, critical for a schema-heavy domain (test workflows, scoring payloads, compliance data).
Depends()system provides clean dependency injection for auth, DB sessions, and role checks without a third-party DI container.- Async-capable but allows synchronous route handlers, which matters because SQLAlchemy (sync) is the ORM and all external data imports are CPU/IO-bound synchronous operations.
- Performance is sufficient for an internal tool (< 100 concurrent users) without needing Go/Rust-level throughput.
- Python ecosystem gives direct access to
taxii2-client,pySigma,boto3,PyYAML, andtoml— all required for the 8 external data source integrations.
Consequences
Positive:
- Swagger UI available in development (
/docs) for rapid API exploration and testing. - Pydantic schemas act as living documentation for the API contract.
Depends()chain forget_db→get_current_user→require_role()is concise and composable.python-jose+passlibintegrate naturally for JWT/bcrypt auth.- SlowAPI integrates directly with FastAPI for rate limiting.
Negative:
- The
Depends()system encourages passingdb: Sessiondirectly into route handlers, which has led to routers containing raw SQLAlchemy queries instead of delegating to a service/repository layer (see ADR analysis — 11 of 21 routers query the DB directly). - Synchronous route handlers block the event loop when performing long operations (MITRE sync ZIP downloads can take 30+ seconds), mitigated by Nginx proxy timeout of 300s.
- No built-in background task system beyond
BackgroundTasks(which is request-scoped), requiring APScheduler for scheduled jobs (see ADR-006).
Risks:
- FastAPI's ease of putting logic in route handlers has contributed to "fat controllers" — this is a developer discipline issue, not a framework limitation.
Alternatives Considered
| Alternative | Reason Rejected |
|---|---|
| Django + DRF | Heavier ORM opinions, admin panel unnecessary, slower startup. Django's ORM lacks SQLAlchemy's flexibility with JSONB and complex joins. |
| Flask + Flask-RESTful | No built-in validation, no auto-generated OpenAPI, manual Swagger setup. Would require marshmallow or similar for schema validation. |
| Go (Gin/Echo) | Team's primary expertise is Python. The 8 data source integrations rely heavily on Python libraries (pySigma, taxii2-client, PyYAML). |
| NestJS (Node.js) | Would split the team across two runtimes. Python libraries for STIX/TAXII and Sigma rule parsing have no mature Node.js equivalents. |
ADR-002: PostgreSQL with JSONB as Primary Database
Date: Project inception
Status: Accepted
Context
Aegis manages a complex relational domain: techniques have tests, tests belong to campaigns, threat actors map to techniques, compliance controls map to techniques, detection rules map to techniques and tests. This is a deeply relational model with 18+ tables and many-to-many relationships.
However, several entities also carry semi-structured data that varies by source:
- Audit logs —
detailsfield contains arbitrary action metadata (different structure per action type). - Threat actors —
aliases,target_sectors,target_regions,referencesare variable-length arrays/objects from STIX 2.0 bundles. - Detection rules —
platforms(array),log_sources(object with varying keys likeproduct,service,category). - Data sources —
last_sync_stats(object with import-specific counters),config(source-specific configuration). - Techniques —
platforms(array of OS names from ATT&CK). - Campaigns —
tags(user-defined array).
This data is imported from external sources with varying schemas (STIX JSON, Sigma YAML, Elastic TOML) and must be stored without rigid column definitions.
Decision
We chose PostgreSQL 15 as the primary database, using its native JSONB column type for semi-structured fields alongside traditional relational columns for the core domain.
The schema is managed by Alembic (18 migration versions) with SQLAlchemy ORM using sqlalchemy.dialects.postgresql.JSONB.
Consequences
Positive:
- Relational integrity enforced with foreign keys for the core domain (test → technique, campaign → test, evidence → test, etc.).
- JSONB columns store variable-structure data without schema migrations when external sources change their format.
- JSONB supports GIN indexing for efficient containment queries (
@>operator) on arrays likeplatformsandtarget_sectors. - Single database to operate — no need for a separate document store.
- PostgreSQL's mature ecosystem:
pg_dumpfor backups,pg_isreadyfor health checks, extensive monitoring tooling. - SQLAlchemy's
JSONBtype allows Python dict/list access with full query support.
Negative:
- JSONB fields bypass ORM-level validation — the schema for
details,config,referencesetc. is only enforced by application code (Pydantic schemas on input), not by the database. - Complex queries mixing relational joins with JSONB containment can be harder to optimize and debug.
- No GIN indexes are currently defined in migrations for JSONB columns, meaning array containment queries may perform full scans on large datasets.
- JSONB fields in audit logs make structured querying across action types difficult (e.g., "find all audit entries where details.old_state = 'draft'").
Risks:
- As JSONB usage grows, the boundary between "should be a column" and "should be JSONB" can blur. Currently well-contained to arrays and metadata fields.
Alternatives Considered
| Alternative | Reason Rejected |
|---|---|
| PostgreSQL without JSONB | Would require separate junction tables for every array field (technique_platforms, actor_aliases, actor_sectors, etc.), adding 10+ tables for data that is always read as a whole array. |
| MongoDB | The core domain is deeply relational (techniques ↔ tests ↔ campaigns ↔ threat actors). Modeling this in MongoDB would require denormalization, embedded documents, or manual reference integrity — trading JSONB flexibility for relational integrity loss. |
| PostgreSQL + MongoDB (dual) | Operational complexity of two database systems is unjustified for the current JSONB usage (~12 columns across 6 tables). |
| MySQL 8 with JSON | PostgreSQL's JSONB is binary-indexed and faster for containment queries. MySQL's JSON type is text-based with function-based indexing. PostgreSQL also has superior support for UUID primary keys (native type vs BINARY(16)). |
ADR-003: MinIO for Evidence Storage
Date: Project inception
Status: Accepted
Context
The Red/Blue team validation workflow requires both teams to upload evidence files (screenshots, log files, PCAPs, documents) to support their test findings. Requirements:
- Files range from small screenshots (KB) to large PCAPs (hundreds of MB).
- Files must be associated with specific tests and teams (red/blue).
- Files must be downloadable by authorized users via the browser.
- Storage must be independent from the application database (no BLOBs in PostgreSQL).
- The platform is deployed on-premise via Docker Compose — cloud-native S3 is not available.
- The upload/download API must be simple and well-supported in Python.
Decision
We chose MinIO as an S3-compatible object storage system, accessed via boto3 (AWS S3 SDK for Python).
Implementation details:
- A single
evidencebucket is auto-created on backend startup (ensure_bucket_exists()). - Files are uploaded with
put_object()using a generated UUID-based key. - Downloads use presigned URLs (
generate_presigned_url()) with 1-hour expiration. - The MinIO client is a module-level singleton in
storage.py. - Evidence metadata (filename, MIME type, size, team, test association) is stored in PostgreSQL; only the binary content lives in MinIO.
Consequences
Positive:
- S3-compatible API means zero code changes if migrating to AWS S3, GCS, or any S3-compatible service.
- boto3 is the most mature and well-documented S3 client library in Python.
- Presigned URLs offload download bandwidth from the backend — the browser fetches directly from MinIO.
- Binary data stays out of PostgreSQL, keeping the database lean and backups fast.
- MinIO runs as a single Docker container with a persistent volume — simple to deploy and back up.
- MinIO Console (port 9001) provides a web UI for administrators to inspect stored files.
Negative:
- Presigned URLs currently point to
minio:9000(Docker internal hostname), which is not accessible from the browser in production without additional Nginx configuration or a public MinIO endpoint. - No file virus scanning or content validation before storage.
- No lifecycle policies configured (no automatic deletion of old evidence).
- The module-level singleton client means the MinIO connection configuration cannot be changed at runtime (acceptable for the current deployment model).
Risks:
- If MinIO container is lost and the volume is not backed up, all evidence files are permanently lost. Evidence metadata in PostgreSQL would reference non-existent files.
Alternatives Considered
| Alternative | Reason Rejected |
|---|---|
| PostgreSQL BYTEA/BLOB | Storing binary files in the database bloats backups, degrades query performance, and makes streaming large files complex. PostgreSQL is not designed as a file store. |
| Local filesystem | Not portable across container restarts without host volume mounts. No presigned URL support, requiring the backend to proxy all downloads. No built-in replication or management UI. |
| AWS S3 | Requires cloud account and internet connectivity. The platform is designed for on-premise deployment where external cloud services may not be permitted. |
| SeaweedFS | Less mature ecosystem, smaller community. The S3-compatible layer is less complete than MinIO's. boto3 compatibility is not guaranteed. |
ADR-004: Docker Compose for Deployment
Date: Project inception
Status: Accepted
Context
Aegis is a multi-component platform deployed on-premise within organizations' security environments:
- 4 services: Frontend (Nginx), Backend (Uvicorn), PostgreSQL, MinIO.
- Target environments range from a single server to small clusters.
- Security teams typically have Docker available but may not have Kubernetes.
- The platform must be installable by a security engineer (not necessarily a DevOps specialist).
- Both development and production environments should use the same orchestration approach for consistency.
Decision
We chose Docker Compose as the deployment and orchestration tool, with two compose files:
docker-compose.yml— Development: source volumes mounted, dev servers, exposed ports.docker-compose.prod.yml— Production: multi-stage builds, Nginx serving static assets, only frontend port exposed,SECRET_KEYrequired.
Supporting infrastructure:
scripts/install.sh— Interactive production installer that generates secrets, prompts for configuration, writes.env, and runsdocker compose up -d --build.scripts/init.sh— Development setup that waits for services, runs migrations, and seeds data.- All services connected via a
aegis-networkbridge network. - Named volumes for PostgreSQL and MinIO data persistence.
- Health checks on PostgreSQL (
pg_isready) and backend (/health). - Service dependency ordering: backend waits for
postgres: service_healthyandminio: service_started.
Consequences
Positive:
- Single-command deployment:
docker compose -f docker-compose.prod.yml up -d --build. - The
install.shwizard makes production setup accessible to non-DevOps personnel. - Consistent environments between development and production (same containers, same network topology).
- Named volumes survive container rebuilds — data persists across upgrades.
- No external dependencies beyond Docker and Docker Compose.
- Multi-stage Dockerfile for frontend produces a minimal Nginx image (~25MB) from a full Node.js build stage.
- Non-root user (
appuser, UID 1001) in backend Dockerfile follows container security best practices.
Negative:
- No built-in horizontal scaling — running multiple backend instances requires manual Nginx upstream configuration and a shared token blacklist (currently in-memory).
- No rolling deployments —
docker compose up -d --buildcauses brief downtime during image rebuilds. - No built-in secrets management — secrets are in
.envfiles on the host filesystem. - No container orchestration beyond restart policies (
restart: always). - No centralized logging — each container logs to its own stdout/stderr.
Risks:
- Single point of failure: if the host machine goes down, all services go down.
- No automated backup strategy —
pg_dumpis documented but not automated.
Alternatives Considered
| Alternative | Reason Rejected |
|---|---|
| Kubernetes (k8s) | Significantly higher operational complexity. Requires a cluster, kubectl expertise, Helm charts or manifests, ingress controllers, PVCs. Overkill for a single-server deployment targeting security teams. |
| Docker Swarm | Adds orchestration complexity with minimal benefit over Compose for < 5 services. The project does not need multi-node scheduling or service mesh. Swarm's future is uncertain compared to Compose V2. |
| Bare metal / systemd | Loses containerization benefits (isolation, reproducibility, dependency management). Would require manual installation of Python, Node.js, PostgreSQL, MinIO on each target system. |
| Ansible + Docker | Adds a configuration management layer that is unnecessary for a 4-service application. Could be valuable in the future for multi-server deployments but is premature now. |
ADR-005: Modular Monolith over Microservices
Date: Project inception
Status: Accepted
Context
Aegis has distinct functional domains that could theoretically be separate services:
- Test Workflow — Red/Blue validation state machine, evidence management.
- Coverage Analytics — Scoring engine, heatmaps, metrics, reports.
- Data Import — 8 external source integrations (MITRE, Sigma, Elastic, CALDERA, etc.).
- Campaign Management — Campaign lifecycle, scheduling, threat actor generation.
- Compliance — Framework mappings, gap analysis, control tracking.
- User/Auth — Authentication, RBAC, audit logging.
However:
- These domains share the same database and have tight data dependencies (e.g., scoring reads tests, techniques, detection rules, and D3FEND mappings in a single calculation).
- The development team is small.
- The deployment target is single-server Docker Compose.
- Latency between services would complicate the scoring engine (which aggregates across 5+ tables).
Decision
We chose a modular monolith architecture: a single deployable backend process organized into internal modules (routers, services, models) rather than separate microservices.
Module boundaries:
- Routers (21 files) — HTTP endpoint definitions grouped by domain.
- Services (20 files) — Business logic grouped by capability (workflow, scoring, notifications, imports).
- Models (18 files) — ORM entities grouped by domain concept.
- Schemas (10 files) — Pydantic DTOs grouped by domain concept.
All modules share a single database, a single process, and a single deployment artifact.
Consequences
Positive:
- No network overhead between domains — scoring can join 5+ tables in a single SQL query.
- Single deployment artifact simplifies CI/CD, monitoring, and debugging.
- Shared database means ACID transactions across domains (e.g., creating a test + logging the audit entry + sending a notification in one commit).
- No service discovery, API gateways, circuit breakers, or distributed tracing needed.
- Faster development iteration — change any module, rebuild one container.
Negative:
- All domains scale together — cannot scale the data import workers independently from the API.
- A bug in one module (e.g., a memory leak in scoring) can crash the entire application.
- Module boundaries are not enforced at the language level — routers currently import services and models freely across domains (e.g.,
heatmap.pyimports 6 models from different domains). - The monolith has grown to 21 routers and 20 services without explicit boundary enforcement, leading to "fat controllers" and cross-cutting concerns.
Risks:
- Without explicit module boundaries (enforced by code structure or linting rules), the modular monolith can degrade into a traditional monolith where everything depends on everything.
- The Clean Architecture refactor proposed in
ARCHITECTURAL_ANALYSIS.mdwould restore module boundaries via the domain/application/infrastructure/presentation layers.
Alternatives Considered
| Alternative | Reason Rejected |
|---|---|
| Microservices | The 8 data source integrations would each become a service, requiring inter-service communication for writing to the shared technique/rule tables. Scoring would need to call 3-4 services to gather data, adding latency and failure modes. Operational overhead (8+ containers, service mesh, distributed tracing) unjustified for a small team and single-server deployment. |
| Microservices with shared DB | Anti-pattern. Multiple services sharing a database lose the main benefit of microservices (independent deployment and schema evolution) while keeping the operational complexity. |
| Modular monolith with enforced boundaries | This is the recommended evolution (see ADR analysis). The current implementation has module structure but no boundary enforcement. Adding domain-layer interfaces (Protocol/ABC), a repository pattern, and import linting rules would achieve this without a microservices migration. |
ADR-006: APScheduler In-Process over External Job System
Date: Project inception
Status: Accepted
Context
Aegis requires periodic background tasks:
| Task | Frequency | Duration | Description |
|---|---|---|---|
| MITRE ATT&CK sync | Every 24 hours | 30-120 seconds | Download STIX/TAXII feed, upsert ~700 techniques |
| Intel scan | Every 7 days | 10-60 seconds | Scan threat intelligence sources |
| Notification cleanup | Every 24 hours | < 5 seconds | Delete read notifications older than 90 days |
| Coverage snapshot | Weekly (Sunday 00:00) | 5-30 seconds | Capture point-in-time coverage state across all techniques |
| Recurring campaigns | Every 24 hours | < 10 seconds | Check and spawn due recurring test campaigns |
Requirements:
- Jobs must access the same database as the API.
- Jobs must not block API request handling.
- No additional infrastructure should be required beyond what Docker Compose already provides.
- Job failure should not crash the API server.
- Jobs do not need distributed execution (single-server deployment).
Decision
We chose APScheduler (BackgroundScheduler) running as an in-process thread within the FastAPI application.
Implementation details:
- The scheduler is started during FastAPI's
lifespanstartup event and shut down on application exit. - Each job function creates its own
SessionLocal()instance, independent from request-scoped sessions. - All jobs use try/except/finally to ensure sessions are closed even on failure.
- Jobs are registered with
replace_existing=Trueto handle server restarts cleanly. - The scheduler is a module-level singleton in
jobs/mitre_sync_job.py.
Consequences
Positive:
- Zero additional infrastructure — no message broker, no worker containers, no job database.
- Jobs share the same Python process, so they can import services directly (
sync_mitre,scan_intel,create_snapshot, etc.) without serialization or RPC. - Simple debugging — job logs appear in the same stdout as API logs.
- Session isolation per job prevents interference with request-scoped transactions.
replace_existing=Trueprevents duplicate job registrations on hot reload.
Negative:
- No persistence: If the server crashes mid-job, the job state is lost. There is no retry mechanism — the job simply runs again at the next scheduled interval.
- No distributed execution: Cannot run jobs on a separate worker node. If the API is under heavy load, jobs compete for the same CPU and memory.
- No dead letter queue: Failed jobs are logged but not queued for retry. A failed MITRE sync silently waits 24 hours before trying again.
- No job history: There is no record of when jobs last ran, how long they took, or whether they succeeded — only log lines.
- Single-instance constraint: If multiple backend instances are running (horizontal scaling), each instance runs its own scheduler, causing duplicate job execution (double MITRE sync, double snapshots, etc.).
- No manual trigger via scheduler: Admin-triggered syncs go through the API endpoints (
/api/v1/system/*), bypassing the scheduler entirely. There are effectively two paths to the same operations.
Risks:
- The single-instance constraint is the most significant risk. If Aegis scales horizontally, APScheduler must be replaced or augmented with a distributed lock (e.g., PostgreSQL advisory locks or Redis-based locking).
Alternatives Considered
| Alternative | Reason Rejected |
|---|---|
| Celery + Redis/RabbitMQ | Requires an additional broker container (Redis or RabbitMQ), a separate worker process, and Celery configuration. Significant operational overhead for 5 periodic tasks that each run for < 2 minutes. Would be justified if job volume grows or horizontal scaling is needed. |
| Dramatiq + Redis | Similar to Celery but lighter. Still requires a Redis container and a separate worker process. Same operational overhead concern. |
| Cron jobs (host-level) | Would require the host to have cron configured and scripts that call API endpoints or run Python commands inside the container. Breaks the "single Docker Compose" deployment model. Not portable. |
PostgreSQL pg_cron |
Runs inside the database, limited to SQL operations. Cannot execute Python logic (downloading ZIPs, parsing YAML, upserting with business rules). Would require stored procedures or external triggers. |
| Kubernetes CronJobs | Requires Kubernetes. Not applicable to the Docker Compose deployment model (see ADR-004). |
| APScheduler with JobStore (PostgreSQL) | APScheduler supports persistent job stores that would solve the single-instance problem via database locking. This is a viable evolution path — same library, minimal code change, adds distributed-safe execution. Recommended as the first upgrade when horizontal scaling is needed. |
ADR Evolution Path
The following table summarizes when each decision should be revisited:
| ADR | Revisit When | Likely Evolution |
|---|---|---|
| ADR-001 (FastAPI) | Stable — no change needed | Add structured logging, OpenTelemetry tracing |
| ADR-002 (PostgreSQL + JSONB) | JSONB query performance degrades | Add GIN indexes on JSONB columns, evaluate moving high-query fields to dedicated columns |
| ADR-003 (MinIO) | Cloud deployment required | Swap boto3 endpoint to AWS S3 / GCS (zero code change) |
| ADR-004 (Docker Compose) | Multi-server deployment needed | Migrate to Kubernetes with Helm charts, or add Ansible playbooks |
| ADR-005 (Modular Monolith) | Team grows > 5 developers, or domains need independent scaling | Enforce boundaries first (Clean Architecture refactor), then extract high-traffic domains as services if needed |
| ADR-006 (APScheduler) | Horizontal scaling required, or jobs need retry/history | Add APScheduler PostgreSQL JobStore first; migrate to Celery if job complexity grows significantly |