Documentation

Network Topology

All infrastructure lives in a single VPC (10.0.0.0/16) with two subnets. Only the load balancer and WireGuard gateway have public internet access. All other nodes are air-gapped — they communicate with the internet only through S3 via a VPC endpoint.

Internet
    │
    ├─── openresty-lb  10.0.1.10  (EIP)   ← HTTPS 443, HTTP 80
    │    Public Subnet 10.0.1.0/24
    └─── wireguard     10.0.1.11  (EIP)   ← UDP 51820 (WireGuard)

    Private Subnet 10.0.2.0/24  [air-gapped, no internet]
    ├─── app1          10.0.2.10
    ├─── app2          10.0.2.11
    ├─── management    10.0.2.20
    ├─── db1           10.0.2.30  + EBS data volume
    └─── db2           10.0.2.31  + EBS data volume

    VPN Subnet 10.100.0.0/16  (WireGuard clients)
    └─── your machine  10.100.0.x  ← full private access after VPN connect

Request Flow

A request from the internet to your application follows this path:

Browser / API client
    │
    ▼
openresty-lb  :443 / :80
    │  nginx handles TLS termination (certbot-managed Let's Encrypt cert)
    │  Lua WAF runs at edge — rate limiting, IP blocking, request filtering
    │
    ├── /api/* ──────────────────────────────────────────────────────────┐
    │   nginx proxies to HAProxy                                          │
    │       ▼                                                             │
    │   management:5001  (HAProxy)                                        │
    │       │  auto-configured from service discovery                     │
    │       ├── app1:8000  (FastAPI / uvicorn)                           │
    │       └── app2:8000  (FastAPI / uvicorn)                           │
    │                                                                     │
    └── /  static assets served directly by nginx ◄─────────────────────┘

    app1 / app2  connect to:
    ├── management:5432  (HAProxy → PgBouncer → PostgreSQL primary)
    ├── management:5433  (HAProxy → PgBouncer → PostgreSQL replica)
    └── management redis  (session cache, queues, service discovery)

Instance Roles

InstanceIPServices
openresty-lb 10.0.1.10 OpenResty (nginx + Lua), certbot (Let's Encrypt TLS), auth_service
wireguard 10.0.1.11 WireGuard server — VPN gateway for your team
app1, app2 10.0.2.10–11 FastAPI (uvicorn), Redis (per-node queues), deployment subscriber
management 10.0.2.20 HAProxy, Redis (management), service discovery, log search API, ops dashboard
db1 10.0.2.30 PostgreSQL 17 (primary), PgBouncer, WAL-G backup
db2 10.0.2.31 PostgreSQL 17 (replica), PgBouncer, WAL-G restore test

Boot Sequence

Each instance boots from a pre-baked AMI. The AMI contains Python 3.13, all services, and the elements library — but no secrets. Secrets are pulled from S3 on every boot.

1. EC2 launches from AMI
2. user-data sets: HOSTNAME, INSTANCE_ROLE, STATIC_IP, /etc/hosts
3. infrastructure-ready.service starts:
   a. Downloads instance-config.env from S3
   b. Downloads prod-credentials.env from S3
   c. Sources both into /etc/environment
4. Role-appropriate services start (from service-ports.json)
5. instance_discovery_service.py registers node in management Redis
6. HAProxy config regenerates automatically (haproxy_config_generator.py)
7. deployment_subscriber.py starts watching Redis for app deployments
ℹ️
auth_service and payment_service use wait-for-db.sh as their ExecStartPre — they poll management:5432 for up to 150 seconds before starting uvicorn, handling the inevitable boot race between app and DB nodes.

Database Layer

app1/app2
    │
    ▼
management:5432  ──► db1:6432 (PgBouncer) ──► PostgreSQL primary
management:5433  ──► db2:6432 (PgBouncer) ──► PostgreSQL replica
                              ▲
                    streaming replication
                              │
                         db1 → db2

HAProxy routes write traffic (:5432) to the primary and read traffic (:5433) to the replica. PgBouncer on each DB node handles connection pooling — your app maintains a small pool to HAProxy, PgBouncer multiplexes many app connections into a few PostgreSQL connections.

WAL-G runs on db1, archiving WAL segments to S3 continuously. db2 runs automated restore tests on a schedule to verify backups are valid.

Service Discovery

instance_discovery_service.py runs on all nodes. Every 30 seconds it writes health and topology data to management Redis. haproxy_config_generator.py watches that data and regenerates the HAProxy config when the cluster topology changes — new nodes, failed nodes, role changes all update routing automatically.

Deployment Pipeline

XeroOps uses a wheel-based deployment system. No GitHub Actions, no container registry.

# On your dev machine
cd your-service/
./build-and-deploy.sh

# build-and-deploy.sh does:
1. python -m build  →  dist/your_service-1.0.0-py3-none-any.whl
2. aws s3 cp dist/*.whl s3://uploads-{account}/wheels/
3. redis PUBLISH deployment-channel '{"service": "your_service", "version": "1.0.0"}'

# On all app nodes simultaneously:
deployment_subscriber.py receives the message
→  aws s3 cp s3://uploads-{account}/wheels/your_service-*.whl .
→  pip install --force-reinstall your_service-*.whl
→  systemctl restart your_service

A deployment reaches all nodes in parallel in under 30 seconds.

TLS — certbot + Let's Encrypt

TLS termination is handled by nginx on openresty-lb. certbot is pre-installed in the AMI and a systemd timer runs certificate renewal automatically. After deploying your cluster, point your domain to the openresty-lb EIP and run sudo certbot --nginx -d yourdomain.com once to obtain the initial certificate.

Lua WAF

The Web Application Firewall runs as Lua code inside OpenResty — zero cost compared to AWS WAF ($5/month per rule set + $1/million requests).

The WAF handles rate limiting, IP blocking, and request filtering at the nginx layer — before requests ever reach your FastAPI services. Rules are defined in Lua and fully customizable. Protected routes require a valid session token; public routes (login page, static assets) are whitelisted.

ℹ️
Why not AWS WAF? At scale, AWS WAF costs add up fast. The Lua WAF runs in-process with nginx — sub-millisecond overhead, no per-request billing, fully customizable in code.

Static Files on S3

All four S3 buckets are named with your AWS account ID as a suffix — globally unique, no collision risk across customers:

BucketPurpose
uploads-{account}Application file storage + S3 config files on first boot
pgdump-{account}PostgreSQL logical backups (pg_dump)
walg-{account}WAL-G continuous WAL archiving for point-in-time recovery
logs-{account}JSON log files from all nodes, uploaded by cron