Production Runbook
Operational procedures for maintaining and recovering ZERG in production. This runbook covers upgrades, backups, incident response, and monitoring.
Rolling Upgrades
Automated Deployment
The deploy script at deployment/scripts/deploy.sh handles the full lifecycle:
./deployment/scripts/deploy.sh root@blue nonsense.ws api.nonsense.wsThe script performs these steps in order:
- Installs system dependencies and Erlang 28 via asdf
- Sets up PostgreSQL databases
- Builds Sol on the target host (avoids glibc mismatch)
- Deploys Luna, Mango, and web frontends
- Backs up the current release
- Writes production configs and generates secrets
- Runs health checks with automatic rollback on failure
Manual Rolling Upgrade
For upgrading individual beam files without a full redeploy:
scp _build/prod/lib/sol/ebin/sol_http.beam zerg@blue:/opt/zerg/lib/sol-{VERSION}/ebin/
ssh root@blue "systemctl restart zerg-sol"For a full release upgrade:
cd server
rebar3 as prod release
scp -r _build/prod/rel/sol/* zerg@blue:/opt/zerg/
ssh root@blue "systemctl restart zerg-sol"Health Check Verification
After any upgrade, verify health:
curl -sf -o /dev/null -w '%{http_code}' http://127.0.0.1:21434/api/v1/readyThe deploy script retries 3 times with 5-second intervals.
Automatic Rollback
If health checks fail after deployment, the script rolls back automatically:
LATEST_BACKUP=$(ls -1dt /opt/zerg/backups/*/ | head -1)
systemctl stop zerg-sol zerg-mango
rm -rf /opt/zerg/releases /opt/zerg/lib /opt/zerg/etc
cp -a $LATEST_BACKUP/* /opt/zerg/
systemctl start zerg-mango
sleep 3
systemctl start zerg-solThe script keeps the last 3 backups and prunes older ones.
Mnesia Backup and Restore
Create a Backup
ssh root@blue "su - zerg -c '/opt/zerg/bin/sol remote'"mnesia:backup("/opt/zerg/data/mnesia_backups/backup-$(date +%Y%m%d%H%M%S)").Restore from Backup
mnesia:restore("/opt/zerg/data/mnesia_backups/backup-20260520120000", []).Verify Mnesia Status
mnesia:system_info(tables).
mnesia:table_info(sol_events, size).ZMQ Worker Management
Check Worker Status
curl -s http://127.0.0.1:21434/api/v1/zmq/workers | jq .
curl -s http://127.0.0.1:21434/api/v1/zmq/status | jq .Restart a Luna Worker
Workers are managed by systemd on the production host:
ssh root@blue "systemctl restart zerg-luna-worker"Workers reconnect automatically via ZMQ. No Sol restart needed.
Worker Connectivity Issues
If workers show as disconnected:
- Check the worker process is running:
systemctl status zerg-luna-worker - Check ZMQ gateway status:
curl -s http://127.0.0.1:21434/api/v1/zmq/status - Verify port 5555 is listening:
ss -tlnp | grep 5555 - Check worker logs:
journalctl -u zerg-luna-worker -n 50 - Restart the worker:
systemctl restart zerg-luna-worker
Workers send heartbeat messages and re-register automatically on reconnect.
Health Check Procedures
Endpoint Health
curl -s http://127.0.0.1:21434/health | jq .The health endpoint checks 20+ ETS tables, disk space, memory, and process counts.
Service Health
systemctl status zerg-sol zerg-mango nginx postgresqlFull System Check
zerg health
zerg discover
zerg zmq-workers
zerg modelsCommon Incident Response
Sol Continuously Restarting
Symptom: nginx 502 errors every few minutes.
Cause: systemd Type=forking with no PID file, or stale beam.smp processes.
systemctl stop zerg-sol
pkill -u zerg beam.smp || true
pkill epmd || true
systemctl start zerg-solEnsure the systemd unit uses Type=simple with ExecStart=/opt/zerg/bin/sol foreground.
Mango Auth Failures
systemctl status zerg-mango
journalctl -u zerg-mango -n 100
curl -s http://127.0.0.1:5800/healthCheck the Mango DB path uses an absolute path: sqlite:///opt/zerg/mango/data/mango.db.
Inference Out of Memory
curl -s http://127.0.0.1:21434/api/v1/inference/status | jq .Reduce the model's context length or switch to a smaller model. The systemd memory limit is 4GB by default.
ETS Table Corruption
ets:info(sol_models).
ets:info(sol_global_catalog).If tables are missing, restart Sol. ETS tables with {heir, Pid} survive owner crashes via the heir mechanism.
Monitoring
Prometheus Metrics
curl -s http://127.0.0.1:21434/metricsEnable with SOL_PROMETHEUS_ENABLED=true. Key metric families:
| Metric | Description |
|---|---|
sol_http_requests_total | HTTP request counter by method/path/status |
sol_inference_duration_seconds | Inference latency histogram |
sol_zmq_workers_connected | Currently connected ZMQ workers |
sol_zmq_tasks_dispatched_total | Tasks dispatched to workers |
sol_scheduler_jobs_active | Active scheduled jobs |
SLO Endpoint
curl -s http://127.0.0.1:21434/api/v1/slo | jq .Grafana Dashboard
Grafana is available when using the monitoring Docker Compose profile:
cd infra
docker compose --profile monitoring up -dConfigure the Prometheus datasource pointing to http://sol:21434/metrics.
Maintenance Mode
Enable maintenance mode to drain traffic gracefully:
curl -X POST http://127.0.0.1:21434/api/v1/infra/maintenance/enable \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"reason": "scheduled maintenance"}'Health and auth endpoints remain accessible during maintenance.
Disable when complete:
curl -X POST http://127.0.0.1:21434/api/v1/infra/maintenance/disable \
-H "Authorization: Bearer $TOKEN"Cache Invalidation
After configuration changes, invalidate cached auth paths:
curl -X POST http://127.0.0.1:21434/api/v1/infra/cache-invalidate \
-H "Authorization: Bearer $TOKEN"This refreshes persistent_term cached values for auth exempt paths and rate limit settings.