Architecture
How XeroOps infrastructure is organized — network topology, instance roles, traffic flow, boot sequence, and deployment pipeline.
Network Topology
All infrastructure lives in a single VPC (10.0.0.0/16) with two subnets. Only the load balancer and WireGuard gateway have public internet access. All other nodes are air-gapped — they communicate with the internet only through S3 via a VPC endpoint.
Internet
│
├─── openresty-lb 10.0.1.10 (EIP) ← HTTPS 443, HTTP 80
│ Public Subnet 10.0.1.0/24
└─── wireguard 10.0.1.11 (EIP) ← UDP 51820 (WireGuard)
Private Subnet 10.0.2.0/24 [air-gapped, no internet]
├─── app1 10.0.2.10
├─── app2 10.0.2.11
├─── management 10.0.2.20
├─── db1 10.0.2.30 + EBS data volume
└─── db2 10.0.2.31 + EBS data volume
VPN Subnet 10.100.0.0/16 (WireGuard clients)
└─── your machine 10.100.0.x ← full private access after VPN connect
Request Flow
A request from the internet to your application follows this path:
Browser / API client
│
▼
openresty-lb :443 / :80
│ nginx handles TLS termination (certbot-managed Let's Encrypt cert)
│ Lua WAF runs at edge — rate limiting, IP blocking, request filtering
│
├── /api/* ──────────────────────────────────────────────────────────┐
│ nginx proxies to HAProxy │
│ ▼ │
│ management:5001 (HAProxy) │
│ │ auto-configured from service discovery │
│ ├── app1:8000 (FastAPI / uvicorn) │
│ └── app2:8000 (FastAPI / uvicorn) │
│ │
└── / static assets served directly by nginx ◄─────────────────────┘
app1 / app2 connect to:
├── management:5432 (HAProxy → PgBouncer → PostgreSQL primary)
├── management:5433 (HAProxy → PgBouncer → PostgreSQL replica)
└── management redis (session cache, queues, service discovery)
Instance Roles
| Instance | IP | Services |
|---|---|---|
| openresty-lb | 10.0.1.10 | OpenResty (nginx + Lua), certbot (Let's Encrypt TLS), auth_service |
| wireguard | 10.0.1.11 | WireGuard server — VPN gateway for your team |
| app1, app2 | 10.0.2.10–11 | FastAPI (uvicorn), Redis (per-node queues), deployment subscriber |
| management | 10.0.2.20 | HAProxy, Redis (management), service discovery, log search API, ops dashboard |
| db1 | 10.0.2.30 | PostgreSQL 17 (primary), PgBouncer, WAL-G backup |
| db2 | 10.0.2.31 | PostgreSQL 17 (replica), PgBouncer, WAL-G restore test |
Boot Sequence
Each instance boots from a pre-baked AMI. The AMI contains Python 3.13, all services, and the elements library — but no secrets. Secrets are pulled from S3 on every boot.
1. EC2 launches from AMI
2. user-data sets: HOSTNAME, INSTANCE_ROLE, STATIC_IP, /etc/hosts
3. infrastructure-ready.service starts:
a. Downloads instance-config.env from S3
b. Downloads prod-credentials.env from S3
c. Sources both into /etc/environment
4. Role-appropriate services start (from service-ports.json)
5. instance_discovery_service.py registers node in management Redis
6. HAProxy config regenerates automatically (haproxy_config_generator.py)
7. deployment_subscriber.py starts watching Redis for app deployments
wait-for-db.sh as their ExecStartPre — they poll management:5432 for up to 150 seconds before starting uvicorn, handling the inevitable boot race between app and DB nodes.Database Layer
app1/app2
│
▼
management:5432 ──► db1:6432 (PgBouncer) ──► PostgreSQL primary
management:5433 ──► db2:6432 (PgBouncer) ──► PostgreSQL replica
▲
streaming replication
│
db1 → db2
HAProxy routes write traffic (:5432) to the primary and read traffic (:5433) to the replica. PgBouncer on each DB node handles connection pooling — your app maintains a small pool to HAProxy, PgBouncer multiplexes many app connections into a few PostgreSQL connections.
WAL-G runs on db1, archiving WAL segments to S3 continuously. db2 runs automated restore tests on a schedule to verify backups are valid.
Service Discovery
instance_discovery_service.py runs on all nodes. Every 30 seconds it writes health and topology data to management Redis. haproxy_config_generator.py watches that data and regenerates the HAProxy config when the cluster topology changes — new nodes, failed nodes, role changes all update routing automatically.
Deployment Pipeline
XeroOps uses a wheel-based deployment system. No GitHub Actions, no container registry.
# On your dev machine
cd your-service/
./build-and-deploy.sh
# build-and-deploy.sh does:
1. python -m build → dist/your_service-1.0.0-py3-none-any.whl
2. aws s3 cp dist/*.whl s3://uploads-{account}/wheels/
3. redis PUBLISH deployment-channel '{"service": "your_service", "version": "1.0.0"}'
# On all app nodes simultaneously:
deployment_subscriber.py receives the message
→ aws s3 cp s3://uploads-{account}/wheels/your_service-*.whl .
→ pip install --force-reinstall your_service-*.whl
→ systemctl restart your_service
A deployment reaches all nodes in parallel in under 30 seconds.
TLS — certbot + Let's Encrypt
TLS termination is handled by nginx on openresty-lb. certbot is pre-installed in the AMI and a systemd timer runs certificate renewal automatically. After deploying your cluster, point your domain to the openresty-lb EIP and run sudo certbot --nginx -d yourdomain.com once to obtain the initial certificate.
Lua WAF
The Web Application Firewall runs as Lua code inside OpenResty — zero cost compared to AWS WAF ($5/month per rule set + $1/million requests).
The WAF handles rate limiting, IP blocking, and request filtering at the nginx layer — before requests ever reach your FastAPI services. Rules are defined in Lua and fully customizable. Protected routes require a valid session token; public routes (login page, static assets) are whitelisted.
Static Files on S3
All four S3 buckets are named with your AWS account ID as a suffix — globally unique, no collision risk across customers:
| Bucket | Purpose |
|---|---|
| uploads-{account} | Application file storage + S3 config files on first boot |
| pgdump-{account} | PostgreSQL logical backups (pg_dump) |
| walg-{account} | WAL-G continuous WAL archiving for point-in-time recovery |
| logs-{account} | JSON log files from all nodes, uploaded by cron |