Skip to content

Production Runbook

Operational procedures for maintaining and recovering ZERG in production. This runbook covers upgrades, backups, incident response, and monitoring.

Rolling Upgrades

Automated Deployment

The deploy script at deployment/scripts/deploy.sh handles the full lifecycle:

bash
./deployment/scripts/deploy.sh root@blue nonsense.ws api.nonsense.ws

The script performs these steps in order:

  1. Installs system dependencies and Erlang 28 via asdf
  2. Sets up PostgreSQL databases
  3. Builds Sol on the target host (avoids glibc mismatch)
  4. Deploys Luna, Mango, and web frontends
  5. Backs up the current release
  6. Writes production configs and generates secrets
  7. Runs health checks with automatic rollback on failure

Manual Rolling Upgrade

For upgrading individual beam files without a full redeploy:

bash
scp _build/prod/lib/sol/ebin/sol_http.beam zerg@blue:/opt/zerg/lib/sol-{VERSION}/ebin/

ssh root@blue "systemctl restart zerg-sol"

For a full release upgrade:

bash
cd server
rebar3 as prod release
scp -r _build/prod/rel/sol/* zerg@blue:/opt/zerg/

ssh root@blue "systemctl restart zerg-sol"

Health Check Verification

After any upgrade, verify health:

bash
curl -sf -o /dev/null -w '%{http_code}' http://127.0.0.1:21434/api/v1/ready

The deploy script retries 3 times with 5-second intervals.

Automatic Rollback

If health checks fail after deployment, the script rolls back automatically:

bash
LATEST_BACKUP=$(ls -1dt /opt/zerg/backups/*/ | head -1)
systemctl stop zerg-sol zerg-mango
rm -rf /opt/zerg/releases /opt/zerg/lib /opt/zerg/etc
cp -a $LATEST_BACKUP/* /opt/zerg/
systemctl start zerg-mango
sleep 3
systemctl start zerg-sol

The script keeps the last 3 backups and prunes older ones.

Mnesia Backup and Restore

Create a Backup

bash
ssh root@blue "su - zerg -c '/opt/zerg/bin/sol remote'"
erlang
mnesia:backup("/opt/zerg/data/mnesia_backups/backup-$(date +%Y%m%d%H%M%S)").

Restore from Backup

erlang
mnesia:restore("/opt/zerg/data/mnesia_backups/backup-20260520120000", []).

Verify Mnesia Status

erlang
mnesia:system_info(tables).
mnesia:table_info(sol_events, size).

ZMQ Worker Management

Check Worker Status

bash
curl -s http://127.0.0.1:21434/api/v1/zmq/workers | jq .
curl -s http://127.0.0.1:21434/api/v1/zmq/status | jq .

Restart a Luna Worker

Workers are managed by systemd on the production host:

bash
ssh root@blue "systemctl restart zerg-luna-worker"

Workers reconnect automatically via ZMQ. No Sol restart needed.

Worker Connectivity Issues

If workers show as disconnected:

  1. Check the worker process is running: systemctl status zerg-luna-worker
  2. Check ZMQ gateway status: curl -s http://127.0.0.1:21434/api/v1/zmq/status
  3. Verify port 5555 is listening: ss -tlnp | grep 5555
  4. Check worker logs: journalctl -u zerg-luna-worker -n 50
  5. Restart the worker: systemctl restart zerg-luna-worker

Workers send heartbeat messages and re-register automatically on reconnect.

Health Check Procedures

Endpoint Health

bash
curl -s http://127.0.0.1:21434/health | jq .

The health endpoint checks 20+ ETS tables, disk space, memory, and process counts.

Service Health

bash
systemctl status zerg-sol zerg-mango nginx postgresql

Full System Check

bash
zerg health
zerg discover
zerg zmq-workers
zerg models

Common Incident Response

Sol Continuously Restarting

Symptom: nginx 502 errors every few minutes.

Cause: systemd Type=forking with no PID file, or stale beam.smp processes.

bash
systemctl stop zerg-sol
pkill -u zerg beam.smp || true
pkill epmd || true
systemctl start zerg-sol

Ensure the systemd unit uses Type=simple with ExecStart=/opt/zerg/bin/sol foreground.

Mango Auth Failures

bash
systemctl status zerg-mango
journalctl -u zerg-mango -n 100
curl -s http://127.0.0.1:5800/health

Check the Mango DB path uses an absolute path: sqlite:///opt/zerg/mango/data/mango.db.

Inference Out of Memory

bash
curl -s http://127.0.0.1:21434/api/v1/inference/status | jq .

Reduce the model's context length or switch to a smaller model. The systemd memory limit is 4GB by default.

ETS Table Corruption

erlang
ets:info(sol_models).
ets:info(sol_global_catalog).

If tables are missing, restart Sol. ETS tables with {heir, Pid} survive owner crashes via the heir mechanism.

Monitoring

Prometheus Metrics

bash
curl -s http://127.0.0.1:21434/metrics

Enable with SOL_PROMETHEUS_ENABLED=true. Key metric families:

MetricDescription
sol_http_requests_totalHTTP request counter by method/path/status
sol_inference_duration_secondsInference latency histogram
sol_zmq_workers_connectedCurrently connected ZMQ workers
sol_zmq_tasks_dispatched_totalTasks dispatched to workers
sol_scheduler_jobs_activeActive scheduled jobs

SLO Endpoint

bash
curl -s http://127.0.0.1:21434/api/v1/slo | jq .

Grafana Dashboard

Grafana is available when using the monitoring Docker Compose profile:

bash
cd infra
docker compose --profile monitoring up -d

Configure the Prometheus datasource pointing to http://sol:21434/metrics.

Maintenance Mode

Enable maintenance mode to drain traffic gracefully:

bash
curl -X POST http://127.0.0.1:21434/api/v1/infra/maintenance/enable \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"reason": "scheduled maintenance"}'

Health and auth endpoints remain accessible during maintenance.

Disable when complete:

bash
curl -X POST http://127.0.0.1:21434/api/v1/infra/maintenance/disable \
  -H "Authorization: Bearer $TOKEN"

Cache Invalidation

After configuration changes, invalidate cached auth paths:

bash
curl -X POST http://127.0.0.1:21434/api/v1/infra/cache-invalidate \
  -H "Authorization: Bearer $TOKEN"

This refreshes persistent_term cached values for auth exempt paths and rate limit settings.

Released under the MIT License.