XeroOps Docs — Architecture

Network Topology

All infrastructure lives in a single VPC (10.0.0.0/16) with two subnets. Only the load balancer and WireGuard gateway have public internet access. All other nodes are air-gapped — they communicate with the internet only through S3 via a VPC endpoint.

Internet
    │
    ├─── openresty-lb  10.0.1.10  (EIP)   ← HTTPS 443, HTTP 80
    │    Public Subnet 10.0.1.0/24
    └─── wireguard     10.0.1.11  (EIP)   ← UDP 51820 (WireGuard)

    Private Subnet 10.0.2.0/24  [air-gapped, no internet]
    ├─── app1          10.0.2.10
    ├─── app2          10.0.2.11
    ├─── management    10.0.2.20
    ├─── db1           10.0.2.30  + EBS data volume
    └─── db2           10.0.2.31  + EBS data volume

    VPN Subnet 10.100.0.0/16  (WireGuard clients)
    └─── your machine  10.100.0.x  ← full private access after VPN connect

Request Flow

A request from the internet to your application follows this path:

Browser / API client
    │
    ▼
openresty-lb  :443 / :80
    │  nginx handles TLS termination (certbot-managed Let's Encrypt cert)
    │  Lua WAF runs at edge — rate limiting, IP blocking, request filtering
    │
    ├── /api/* ──────────────────────────────────────────────────────────┐
    │   nginx proxies to HAProxy                                          │
    │       ▼                                                             │
    │   management:5001  (HAProxy)                                        │
    │       │  auto-configured from service discovery                     │
    │       ├── app1:8000  (FastAPI / uvicorn)                           │
    │       └── app2:8000  (FastAPI / uvicorn)                           │
    │                                                                     │
    └── /  static assets served directly by nginx ◄─────────────────────┘

    app1 / app2  connect to:
    ├── management:5432  (HAProxy → PgBouncer → PostgreSQL primary)
    ├── management:5433  (HAProxy → PgBouncer → PostgreSQL replica)
    └── management redis  (session cache, queues, service discovery)

Instance Roles

Instance	IP	Services
openresty-lb	10.0.1.10	OpenResty (nginx + Lua), certbot (Let's Encrypt TLS), auth_service
wireguard	10.0.1.11	WireGuard server — VPN gateway for your team
app1, app2	10.0.2.10–11	FastAPI (uvicorn), Redis (per-node queues), deployment subscriber
management	10.0.2.20	HAProxy, Redis (management), service discovery, log search API, ops dashboard
db1	10.0.2.30	PostgreSQL 17 (primary), PgBouncer, WAL-G backup
db2	10.0.2.31	PostgreSQL 17 (replica), PgBouncer, WAL-G restore test

Boot Sequence

Each instance boots from a pre-baked AMI. The AMI contains Python 3.13, all services, and the elements library — but no secrets. Secrets are pulled from S3 on every boot.

1. EC2 launches from AMI
2. user-data sets: HOSTNAME, INSTANCE_ROLE, STATIC_IP, /etc/hosts
3. infrastructure-ready.service starts:
   a. Downloads instance-config.env from S3
   b. Downloads prod-credentials.env from S3
   c. Sources both into /etc/environment
4. Role-appropriate services start (from service-ports.json)
5. instance discovery service registers node in management Redis
6. HAProxy config regenerates automatically
7. deployment subscriber starts watching Redis for app deployments

ℹ️

auth_service and payment_service use wait-for-db.sh as their ExecStartPre — they poll management:5432 for up to 150 seconds before starting uvicorn, handling the inevitable boot race between app and DB nodes.

Database Layer

app1/app2
    │
    ▼
management:5432  ──► db1:6432 (PgBouncer) ──► PostgreSQL primary
management:5433  ──► db2:6432 (PgBouncer) ──► PostgreSQL replica
                              ▲
                    streaming replication
                              │
                         db1 → db2

HAProxy routes write traffic (:5432) to the primary and read traffic (:5433) to the replica. PgBouncer on each DB node handles connection pooling — your app maintains a small pool to HAProxy, PgBouncer multiplexes many app connections into a few PostgreSQL connections.

WAL-G runs on db1, archiving WAL segments to S3 continuously. db2 runs automated restore tests on a schedule to verify backups are valid.

A monitoring job runs on both db nodes every few minutes, checking PostgreSQL availability, replication lag, and backup freshness (both WAL-G and pg_dump). Results are written to management Redis and surfaced on the ops dashboard — repeated failures (not single blips) trigger an alert.

Service Discovery

An instance discovery service runs on all nodes. Every 30 seconds it writes health and topology data to management Redis. The HAProxy config generator watches that data and regenerates the HAProxy config when the cluster topology changes — new nodes, failed nodes, role changes all update routing automatically.

Ops Dashboard — System Command Center

Every deployment includes a built-in admin dashboard — no extra setup, nothing to install. Once connected via WireGuard, open http://10.0.1.10:8080 in your browser. It's a single page that gives you:

Live cluster health — every node's status, CPU/memory/disk, and replication lag, updated in real time over a WebSocket connection.
Backup verification — pass/fail and age for the most recent WAL-G and pg_dump restore tests, so you know your backups actually restore, not just that they ran.
Distributed log search — full-text search across the JSON logs from every node, indexed in management Redis.
Domain & permission registry — register the domains your services run under and define the permissions used by require_permission() in the elements SDK.
User & role management — browse users, see their current tier per domain, and grant or revoke roles and individual permissions.

ℹ️

The dashboard talks to auth_service and the management node directly — it's only reachable over the WireGuard VPN, not from the public internet.

Deployment Pipeline

XeroOps uses a wheel-based deployment system. No GitHub Actions, no container registry. Build your service into a Python wheel any way you like, upload it to your S3 uploads bucket, then publish one message to a Redis channel — every node running that service picks it up automatically.

# 1. Build your wheel (your own build process — e.g. python -m build)
# 2. Upload it to your uploads bucket
aws s3 cp dist/your_service-1.0.0-*.whl s3://uploads-{account}/your_service/

# 3. Publish a deployment message — every node picks it up
redis-cli -h management PUBLISH deployments '{
    "service": "your_service",
    "version": "1.0.0",
    "filename": "your_service-1.0.0-cp313-cp313-linux_x86_64.whl",
    "url": "s3://uploads-{account}/your_service/your_service-1.0.0-cp313-cp313-linux_x86_64.whl"
}'

# On every node running this service:
the deployment subscriber receives the message
→  downloads the wheel from the given S3 url
→  pip install --force-reinstall --no-deps <wheel>
→  installs any systemd unit files bundled with the wheel
→  restarts the service

A deployment reaches all nodes in parallel in under 30 seconds. Each node also runs this same check on every boot, so a freshly launched instance automatically catches up to the latest deployed version of every service it's responsible for.

Authentication & Multi-Domain Routing

auth_service runs on openresty-lb and is the single front door for login across every domain your cluster serves. It supports two login methods — Google OAuth and passwordless magic-link emails — both protected by reCAPTCHA v3 before a request ever reaches the database.

A cluster can serve multiple customer-facing domains from the same backend. Each domain is registered in the database with its own roles and permissions, so the same user can hold different access levels on different domains. Incoming requests are matched to a domain via the Host header, and that domain is what role/permission checks are scoped to.

On successful login, auth_service writes a session to session Redis (on openresty-lb, DB 1) and the user's browser receives a session cookie. Role and permission checks for subsequent requests — covered in the elements SDK reference — read from this session and from PostgreSQL functions, not from any external identity provider.

TLS — certbot + Let's Encrypt

TLS termination is handled by nginx on openresty-lb. certbot is pre-installed in the AMI and a systemd timer runs certificate renewal automatically. After deploying your cluster, point your domain to the openresty-lb EIP and run sudo certbot --nginx -d yourdomain.com once to obtain the initial certificate.

Lua WAF

The Web Application Firewall runs as Lua code inside OpenResty — zero cost compared to AWS WAF ($5/month per rule set + $1/million requests).

The WAF handles rate limiting, IP blocking, and request filtering at the nginx layer — before requests ever reach your FastAPI services. Rules are defined in Lua and fully customizable. Protected routes require a valid session token; public routes (login page, static assets) are whitelisted.

ℹ️

Why not AWS WAF? At scale, AWS WAF costs add up fast. The Lua WAF runs in-process with nginx — sub-millisecond overhead, no per-request billing, fully customizable in code.

Static Files on S3

All four S3 buckets are named with your AWS account ID as a suffix — globally unique, no collision risk across customers:

Bucket	Purpose
uploads-{account}	Application file storage + S3 config files on first boot
pgdump-{account}	PostgreSQL logical backups (pg_dump)
walg-{account}	WAL-G continuous WAL archiving for point-in-time recovery
logs-{account}	JSON log files from all nodes, uploaded by cron