# Aegis — Architecture Decision Records (ADR) > **Date:** February 11, 2026 > **Status:** All decisions are **Accepted** and currently in effect. --- ## Index | ADR | Title | Status | |-----|-------|--------| | [ADR-001](#adr-001-fastapi-as-backend-framework) | FastAPI as Backend Framework | Accepted | | [ADR-002](#adr-002-postgresql-with-jsonb-as-primary-database) | PostgreSQL with JSONB as Primary Database | Accepted | | [ADR-003](#adr-003-minio-for-evidence-storage) | MinIO for Evidence Storage | Accepted | | [ADR-004](#adr-004-docker-compose-for-deployment) | Docker Compose for Deployment | Accepted | | [ADR-005](#adr-005-modular-monolith-over-microservices) | Modular Monolith over Microservices | Accepted | | [ADR-006](#adr-006-apscheduler-in-process-over-external-job-system) | APScheduler In-Process over External Job System | Accepted | --- ## ADR-001: FastAPI as Backend Framework **Date:** Project inception **Status:** Accepted ### Context Aegis is an internal security platform for managing MITRE ATT&CK coverage through Red/Blue team validation workflows. The backend must: - Expose a REST API consumed by a React SPA (21 pages, 80+ endpoints). - Handle CRUD operations for 18+ domain entities with complex filtering and joins. - Support file uploads (evidence) and streaming downloads (CSV/JSON exports). - Integrate with external APIs (MITRE TAXII 2.0, GitHub REST, D3FEND REST). - Enforce RBAC authorization across 6 roles. - Be developed and maintained by a small team requiring fast iteration. - Run in a containerized environment with Python as the team's primary language. ### Decision We chose **FastAPI** as the backend framework, served by **Uvicorn** (ASGI). Key factors: - **Automatic OpenAPI/Swagger** generation from type hints reduces documentation burden for 80+ endpoints. - **Pydantic integration** provides request/response validation with zero boilerplate, critical for a schema-heavy domain (test workflows, scoring payloads, compliance data). - **`Depends()` system** provides clean dependency injection for auth, DB sessions, and role checks without a third-party DI container. - **Async-capable** but allows synchronous route handlers, which matters because SQLAlchemy (sync) is the ORM and all external data imports are CPU/IO-bound synchronous operations. - **Performance** is sufficient for an internal tool (< 100 concurrent users) without needing Go/Rust-level throughput. - **Python ecosystem** gives direct access to `taxii2-client`, `pySigma`, `boto3`, `PyYAML`, and `toml` — all required for the 8 external data source integrations. ### Consequences **Positive:** - Swagger UI available in development (`/docs`) for rapid API exploration and testing. - Pydantic schemas act as living documentation for the API contract. - `Depends()` chain for `get_db` → `get_current_user` → `require_role()` is concise and composable. - `python-jose` + `passlib` integrate naturally for JWT/bcrypt auth. - SlowAPI integrates directly with FastAPI for rate limiting. **Negative:** - The `Depends()` system encourages passing `db: Session` directly into route handlers, which has led to routers containing raw SQLAlchemy queries instead of delegating to a service/repository layer (see ADR analysis — 11 of 21 routers query the DB directly). - Synchronous route handlers block the event loop when performing long operations (MITRE sync ZIP downloads can take 30+ seconds), mitigated by Nginx proxy timeout of 300s. - No built-in background task system beyond `BackgroundTasks` (which is request-scoped), requiring APScheduler for scheduled jobs (see ADR-006). **Risks:** - FastAPI's ease of putting logic in route handlers has contributed to "fat controllers" — this is a developer discipline issue, not a framework limitation. ### Alternatives Considered | Alternative | Reason Rejected | |------------|-----------------| | **Django + DRF** | Heavier ORM opinions, admin panel unnecessary, slower startup. Django's ORM lacks SQLAlchemy's flexibility with JSONB and complex joins. | | **Flask + Flask-RESTful** | No built-in validation, no auto-generated OpenAPI, manual Swagger setup. Would require marshmallow or similar for schema validation. | | **Go (Gin/Echo)** | Team's primary expertise is Python. The 8 data source integrations rely heavily on Python libraries (pySigma, taxii2-client, PyYAML). | | **NestJS (Node.js)** | Would split the team across two runtimes. Python libraries for STIX/TAXII and Sigma rule parsing have no mature Node.js equivalents. | --- ## ADR-002: PostgreSQL with JSONB as Primary Database **Date:** Project inception **Status:** Accepted ### Context Aegis manages a complex relational domain: techniques have tests, tests belong to campaigns, threat actors map to techniques, compliance controls map to techniques, detection rules map to techniques and tests. This is a deeply relational model with 18+ tables and many-to-many relationships. However, several entities also carry semi-structured data that varies by source: - **Audit logs** — `details` field contains arbitrary action metadata (different structure per action type). - **Threat actors** — `aliases`, `target_sectors`, `target_regions`, `references` are variable-length arrays/objects from STIX 2.0 bundles. - **Detection rules** — `platforms` (array), `log_sources` (object with varying keys like `product`, `service`, `category`). - **Data sources** — `last_sync_stats` (object with import-specific counters), `config` (source-specific configuration). - **Techniques** — `platforms` (array of OS names from ATT&CK). - **Campaigns** — `tags` (user-defined array). This data is imported from external sources with varying schemas (STIX JSON, Sigma YAML, Elastic TOML) and must be stored without rigid column definitions. ### Decision We chose **PostgreSQL 15** as the primary database, using its native **JSONB** column type for semi-structured fields alongside traditional relational columns for the core domain. The schema is managed by **Alembic** (18 migration versions) with **SQLAlchemy** ORM using `sqlalchemy.dialects.postgresql.JSONB`. ### Consequences **Positive:** - Relational integrity enforced with foreign keys for the core domain (test → technique, campaign → test, evidence → test, etc.). - JSONB columns store variable-structure data without schema migrations when external sources change their format. - JSONB supports GIN indexing for efficient containment queries (`@>` operator) on arrays like `platforms` and `target_sectors`. - Single database to operate — no need for a separate document store. - PostgreSQL's mature ecosystem: `pg_dump` for backups, `pg_isready` for health checks, extensive monitoring tooling. - SQLAlchemy's `JSONB` type allows Python dict/list access with full query support. **Negative:** - JSONB fields bypass ORM-level validation — the schema for `details`, `config`, `references` etc. is only enforced by application code (Pydantic schemas on input), not by the database. - Complex queries mixing relational joins with JSONB containment can be harder to optimize and debug. - No GIN indexes are currently defined in migrations for JSONB columns, meaning array containment queries may perform full scans on large datasets. - JSONB fields in audit logs make structured querying across action types difficult (e.g., "find all audit entries where details.old_state = 'draft'"). **Risks:** - As JSONB usage grows, the boundary between "should be a column" and "should be JSONB" can blur. Currently well-contained to arrays and metadata fields. ### Alternatives Considered | Alternative | Reason Rejected | |------------|-----------------| | **PostgreSQL without JSONB** | Would require separate junction tables for every array field (technique_platforms, actor_aliases, actor_sectors, etc.), adding 10+ tables for data that is always read as a whole array. | | **MongoDB** | The core domain is deeply relational (techniques ↔ tests ↔ campaigns ↔ threat actors). Modeling this in MongoDB would require denormalization, embedded documents, or manual reference integrity — trading JSONB flexibility for relational integrity loss. | | **PostgreSQL + MongoDB (dual)** | Operational complexity of two database systems is unjustified for the current JSONB usage (~12 columns across 6 tables). | | **MySQL 8 with JSON** | PostgreSQL's JSONB is binary-indexed and faster for containment queries. MySQL's JSON type is text-based with function-based indexing. PostgreSQL also has superior support for UUID primary keys (native type vs BINARY(16)). | --- ## ADR-003: MinIO for Evidence Storage **Date:** Project inception **Status:** Accepted ### Context The Red/Blue team validation workflow requires both teams to upload evidence files (screenshots, log files, PCAPs, documents) to support their test findings. Requirements: - Files range from small screenshots (KB) to large PCAPs (hundreds of MB). - Files must be associated with specific tests and teams (red/blue). - Files must be downloadable by authorized users via the browser. - Storage must be independent from the application database (no BLOBs in PostgreSQL). - The platform is deployed on-premise via Docker Compose — cloud-native S3 is not available. - The upload/download API must be simple and well-supported in Python. ### Decision We chose **MinIO** as an S3-compatible object storage system, accessed via **boto3** (AWS S3 SDK for Python). Implementation details: - A single `evidence` bucket is auto-created on backend startup (`ensure_bucket_exists()`). - Files are uploaded with `put_object()` using a generated UUID-based key. - Downloads use presigned URLs (`generate_presigned_url()`) with 1-hour expiration. - The MinIO client is a module-level singleton in `storage.py`. - Evidence metadata (filename, MIME type, size, team, test association) is stored in PostgreSQL; only the binary content lives in MinIO. ### Consequences **Positive:** - S3-compatible API means zero code changes if migrating to AWS S3, GCS, or any S3-compatible service. - boto3 is the most mature and well-documented S3 client library in Python. - Presigned URLs offload download bandwidth from the backend — the browser fetches directly from MinIO. - Binary data stays out of PostgreSQL, keeping the database lean and backups fast. - MinIO runs as a single Docker container with a persistent volume — simple to deploy and back up. - MinIO Console (port 9001) provides a web UI for administrators to inspect stored files. **Negative:** - Presigned URLs currently point to `minio:9000` (Docker internal hostname), which is not accessible from the browser in production without additional Nginx configuration or a public MinIO endpoint. - No file virus scanning or content validation before storage. - No lifecycle policies configured (no automatic deletion of old evidence). - The module-level singleton client means the MinIO connection configuration cannot be changed at runtime (acceptable for the current deployment model). **Risks:** - If MinIO container is lost and the volume is not backed up, all evidence files are permanently lost. Evidence metadata in PostgreSQL would reference non-existent files. ### Alternatives Considered | Alternative | Reason Rejected | |------------|-----------------| | **PostgreSQL BYTEA/BLOB** | Storing binary files in the database bloats backups, degrades query performance, and makes streaming large files complex. PostgreSQL is not designed as a file store. | | **Local filesystem** | Not portable across container restarts without host volume mounts. No presigned URL support, requiring the backend to proxy all downloads. No built-in replication or management UI. | | **AWS S3** | Requires cloud account and internet connectivity. The platform is designed for on-premise deployment where external cloud services may not be permitted. | | **SeaweedFS** | Less mature ecosystem, smaller community. The S3-compatible layer is less complete than MinIO's. boto3 compatibility is not guaranteed. | --- ## ADR-004: Docker Compose for Deployment **Date:** Project inception **Status:** Accepted ### Context Aegis is a multi-component platform deployed on-premise within organizations' security environments: - 4 services: Frontend (Nginx), Backend (Uvicorn), PostgreSQL, MinIO. - Target environments range from a single server to small clusters. - Security teams typically have Docker available but may not have Kubernetes. - The platform must be installable by a security engineer (not necessarily a DevOps specialist). - Both development and production environments should use the same orchestration approach for consistency. ### Decision We chose **Docker Compose** as the deployment and orchestration tool, with two compose files: - `docker-compose.yml` — Development: source volumes mounted, dev servers, exposed ports. - `docker-compose.prod.yml` — Production: multi-stage builds, Nginx serving static assets, only frontend port exposed, `SECRET_KEY` required. Supporting infrastructure: - `scripts/install.sh` — Interactive production installer that generates secrets, prompts for configuration, writes `.env`, and runs `docker compose up -d --build`. - `scripts/init.sh` — Development setup that waits for services, runs migrations, and seeds data. - All services connected via a `aegis-network` bridge network. - Named volumes for PostgreSQL and MinIO data persistence. - Health checks on PostgreSQL (`pg_isready`) and backend (`/health`). - Service dependency ordering: backend waits for `postgres: service_healthy` and `minio: service_started`. ### Consequences **Positive:** - Single-command deployment: `docker compose -f docker-compose.prod.yml up -d --build`. - The `install.sh` wizard makes production setup accessible to non-DevOps personnel. - Consistent environments between development and production (same containers, same network topology). - Named volumes survive container rebuilds — data persists across upgrades. - No external dependencies beyond Docker and Docker Compose. - Multi-stage Dockerfile for frontend produces a minimal Nginx image (~25MB) from a full Node.js build stage. - Non-root user (`appuser`, UID 1001) in backend Dockerfile follows container security best practices. **Negative:** - No built-in horizontal scaling — running multiple backend instances requires manual Nginx upstream configuration and a shared token blacklist (currently in-memory). - No rolling deployments — `docker compose up -d --build` causes brief downtime during image rebuilds. - No built-in secrets management — secrets are in `.env` files on the host filesystem. - No container orchestration beyond restart policies (`restart: always`). - No centralized logging — each container logs to its own stdout/stderr. **Risks:** - Single point of failure: if the host machine goes down, all services go down. - No automated backup strategy — `pg_dump` is documented but not automated. ### Alternatives Considered | Alternative | Reason Rejected | |------------|-----------------| | **Kubernetes (k8s)** | Significantly higher operational complexity. Requires a cluster, kubectl expertise, Helm charts or manifests, ingress controllers, PVCs. Overkill for a single-server deployment targeting security teams. | | **Docker Swarm** | Adds orchestration complexity with minimal benefit over Compose for < 5 services. The project does not need multi-node scheduling or service mesh. Swarm's future is uncertain compared to Compose V2. | | **Bare metal / systemd** | Loses containerization benefits (isolation, reproducibility, dependency management). Would require manual installation of Python, Node.js, PostgreSQL, MinIO on each target system. | | **Ansible + Docker** | Adds a configuration management layer that is unnecessary for a 4-service application. Could be valuable in the future for multi-server deployments but is premature now. | --- ## ADR-005: Modular Monolith over Microservices **Date:** Project inception **Status:** Accepted ### Context Aegis has distinct functional domains that could theoretically be separate services: - **Test Workflow** — Red/Blue validation state machine, evidence management. - **Coverage Analytics** — Scoring engine, heatmaps, metrics, reports. - **Data Import** — 8 external source integrations (MITRE, Sigma, Elastic, CALDERA, etc.). - **Campaign Management** — Campaign lifecycle, scheduling, threat actor generation. - **Compliance** — Framework mappings, gap analysis, control tracking. - **User/Auth** — Authentication, RBAC, audit logging. However: - These domains share the same database and have tight data dependencies (e.g., scoring reads tests, techniques, detection rules, and D3FEND mappings in a single calculation). - The development team is small. - The deployment target is single-server Docker Compose. - Latency between services would complicate the scoring engine (which aggregates across 5+ tables). ### Decision We chose a **modular monolith** architecture: a single deployable backend process organized into internal modules (routers, services, models) rather than separate microservices. Module boundaries: - **Routers** (21 files) — HTTP endpoint definitions grouped by domain. - **Services** (20 files) — Business logic grouped by capability (workflow, scoring, notifications, imports). - **Models** (18 files) — ORM entities grouped by domain concept. - **Schemas** (10 files) — Pydantic DTOs grouped by domain concept. All modules share a single database, a single process, and a single deployment artifact. ### Consequences **Positive:** - No network overhead between domains — scoring can join 5+ tables in a single SQL query. - Single deployment artifact simplifies CI/CD, monitoring, and debugging. - Shared database means ACID transactions across domains (e.g., creating a test + logging the audit entry + sending a notification in one commit). - No service discovery, API gateways, circuit breakers, or distributed tracing needed. - Faster development iteration — change any module, rebuild one container. **Negative:** - All domains scale together — cannot scale the data import workers independently from the API. - A bug in one module (e.g., a memory leak in scoring) can crash the entire application. - Module boundaries are not enforced at the language level — routers currently import services and models freely across domains (e.g., `heatmap.py` imports 6 models from different domains). - The monolith has grown to 21 routers and 20 services without explicit boundary enforcement, leading to "fat controllers" and cross-cutting concerns. **Risks:** - Without explicit module boundaries (enforced by code structure or linting rules), the modular monolith can degrade into a traditional monolith where everything depends on everything. - The Clean Architecture refactor proposed in `ARCHITECTURAL_ANALYSIS.md` would restore module boundaries via the domain/application/infrastructure/presentation layers. ### Alternatives Considered | Alternative | Reason Rejected | |------------|-----------------| | **Microservices** | The 8 data source integrations would each become a service, requiring inter-service communication for writing to the shared technique/rule tables. Scoring would need to call 3-4 services to gather data, adding latency and failure modes. Operational overhead (8+ containers, service mesh, distributed tracing) unjustified for a small team and single-server deployment. | | **Microservices with shared DB** | Anti-pattern. Multiple services sharing a database lose the main benefit of microservices (independent deployment and schema evolution) while keeping the operational complexity. | | **Modular monolith with enforced boundaries** | This is the recommended evolution (see ADR analysis). The current implementation has module structure but no boundary enforcement. Adding domain-layer interfaces (Protocol/ABC), a repository pattern, and import linting rules would achieve this without a microservices migration. | --- ## ADR-006: APScheduler In-Process over External Job System **Date:** Project inception **Status:** Accepted ### Context Aegis requires periodic background tasks: | Task | Frequency | Duration | Description | |------|-----------|----------|-------------| | MITRE ATT&CK sync | Every 24 hours | 30-120 seconds | Download STIX/TAXII feed, upsert ~700 techniques | | Intel scan | Every 7 days | 10-60 seconds | Scan threat intelligence sources | | Notification cleanup | Every 24 hours | < 5 seconds | Delete read notifications older than 90 days | | Coverage snapshot | Weekly (Sunday 00:00) | 5-30 seconds | Capture point-in-time coverage state across all techniques | | Recurring campaigns | Every 24 hours | < 10 seconds | Check and spawn due recurring test campaigns | Requirements: - Jobs must access the same database as the API. - Jobs must not block API request handling. - No additional infrastructure should be required beyond what Docker Compose already provides. - Job failure should not crash the API server. - Jobs do not need distributed execution (single-server deployment). ### Decision We chose **APScheduler** (`BackgroundScheduler`) running as an in-process thread within the FastAPI application. Implementation details: - The scheduler is started during FastAPI's `lifespan` startup event and shut down on application exit. - Each job function creates its own `SessionLocal()` instance, independent from request-scoped sessions. - All jobs use try/except/finally to ensure sessions are closed even on failure. - Jobs are registered with `replace_existing=True` to handle server restarts cleanly. - The scheduler is a module-level singleton in `jobs/mitre_sync_job.py`. ### Consequences **Positive:** - Zero additional infrastructure — no message broker, no worker containers, no job database. - Jobs share the same Python process, so they can import services directly (`sync_mitre`, `scan_intel`, `create_snapshot`, etc.) without serialization or RPC. - Simple debugging — job logs appear in the same stdout as API logs. - Session isolation per job prevents interference with request-scoped transactions. - `replace_existing=True` prevents duplicate job registrations on hot reload. **Negative:** - **No persistence:** If the server crashes mid-job, the job state is lost. There is no retry mechanism — the job simply runs again at the next scheduled interval. - **No distributed execution:** Cannot run jobs on a separate worker node. If the API is under heavy load, jobs compete for the same CPU and memory. - **No dead letter queue:** Failed jobs are logged but not queued for retry. A failed MITRE sync silently waits 24 hours before trying again. - **No job history:** There is no record of when jobs last ran, how long they took, or whether they succeeded — only log lines. - **Single-instance constraint:** If multiple backend instances are running (horizontal scaling), each instance runs its own scheduler, causing duplicate job execution (double MITRE sync, double snapshots, etc.). - **No manual trigger via scheduler:** Admin-triggered syncs go through the API endpoints (`/api/v1/system/*`), bypassing the scheduler entirely. There are effectively two paths to the same operations. **Risks:** - The single-instance constraint is the most significant risk. If Aegis scales horizontally, APScheduler must be replaced or augmented with a distributed lock (e.g., PostgreSQL advisory locks or Redis-based locking). ### Alternatives Considered | Alternative | Reason Rejected | |------------|-----------------| | **Celery + Redis/RabbitMQ** | Requires an additional broker container (Redis or RabbitMQ), a separate worker process, and Celery configuration. Significant operational overhead for 5 periodic tasks that each run for < 2 minutes. Would be justified if job volume grows or horizontal scaling is needed. | | **Dramatiq + Redis** | Similar to Celery but lighter. Still requires a Redis container and a separate worker process. Same operational overhead concern. | | **Cron jobs (host-level)** | Would require the host to have cron configured and scripts that call API endpoints or run Python commands inside the container. Breaks the "single Docker Compose" deployment model. Not portable. | | **PostgreSQL `pg_cron`** | Runs inside the database, limited to SQL operations. Cannot execute Python logic (downloading ZIPs, parsing YAML, upserting with business rules). Would require stored procedures or external triggers. | | **Kubernetes CronJobs** | Requires Kubernetes. Not applicable to the Docker Compose deployment model (see ADR-004). | | **APScheduler with JobStore (PostgreSQL)** | APScheduler supports persistent job stores that would solve the single-instance problem via database locking. This is a viable evolution path — same library, minimal code change, adds distributed-safe execution. **Recommended as the first upgrade when horizontal scaling is needed.** | --- ## ADR Evolution Path The following table summarizes when each decision should be revisited: | ADR | Revisit When | Likely Evolution | |-----|-------------|-----------------| | ADR-001 (FastAPI) | Stable — no change needed | Add structured logging, OpenTelemetry tracing | | ADR-002 (PostgreSQL + JSONB) | JSONB query performance degrades | Add GIN indexes on JSONB columns, evaluate moving high-query fields to dedicated columns | | ADR-003 (MinIO) | Cloud deployment required | Swap boto3 endpoint to AWS S3 / GCS (zero code change) | | ADR-004 (Docker Compose) | Multi-server deployment needed | Migrate to Kubernetes with Helm charts, or add Ansible playbooks | | ADR-005 (Modular Monolith) | Team grows > 5 developers, or domains need independent scaling | Enforce boundaries first (Clean Architecture refactor), then extract high-traffic domains as services if needed | | ADR-006 (APScheduler) | Horizontal scaling required, or jobs need retry/history | Add APScheduler PostgreSQL JobStore first; migrate to Celery if job complexity grows significantly |