Production Runbook

Operational procedures for maintaining and recovering ZERG in production. This runbook covers upgrades, backups, incident response, and monitoring.

Rolling Upgrades

Automated Deployment

The deploy script at deployment/scripts/deploy.sh handles the full lifecycle:

bash

./deployment/scripts/deploy.sh root@blue nonsense.ws api.nonsense.ws

The script performs these steps in order:

Installs system dependencies and Erlang 28 via asdf
Sets up PostgreSQL databases
Builds Sol on the target host (avoids glibc mismatch)
Deploys Luna, Mango, and web frontends
Backs up the current release
Writes production configs and generates secrets
Runs health checks with automatic rollback on failure

Manual Rolling Upgrade

For upgrading individual beam files without a full redeploy:

bash

scp _build/prod/lib/sol/ebin/sol_http.beam zerg@blue:/opt/zerg/lib/sol-{VERSION}/ebin/

ssh root@blue "systemctl restart zerg-sol"

For a full release upgrade:

bash

cd server
rebar3 as prod release
scp -r _build/prod/rel/sol/* zerg@blue:/opt/zerg/

ssh root@blue "systemctl restart zerg-sol"

Health Check Verification

After any upgrade, verify health:

bash

curl -sf -o /dev/null -w '%{http_code}' http://127.0.0.1:21434/api/v1/ready

The deploy script retries 3 times with 5-second intervals.

Automatic Rollback

If health checks fail after deployment, the script rolls back automatically:

bash

LATEST_BACKUP=$(ls -1dt /opt/zerg/backups/*/ | head -1)
systemctl stop zerg-sol zerg-mango
rm -rf /opt/zerg/releases /opt/zerg/lib /opt/zerg/etc
cp -a $LATEST_BACKUP/* /opt/zerg/
systemctl start zerg-mango
sleep 3
systemctl start zerg-sol

The script keeps the last 3 backups and prunes older ones.

Mnesia Backup and Restore

Create a Backup

bash

ssh root@blue "su - zerg -c '/opt/zerg/bin/sol remote'"

erlang

mnesia:backup("/opt/zerg/data/mnesia_backups/backup-$(date +%Y%m%d%H%M%S)").

Restore from Backup

erlang

mnesia:restore("/opt/zerg/data/mnesia_backups/backup-20260520120000", []).

Verify Mnesia Status

erlang

mnesia:system_info(tables).
mnesia:table_info(sol_events, size).

ZMQ Worker Management

Check Worker Status

bash

curl -s http://127.0.0.1:21434/api/v1/zmq/workers | jq .
curl -s http://127.0.0.1:21434/api/v1/zmq/status | jq .

Restart a Luna Worker

Workers are managed by systemd on the production host:

bash

ssh root@blue "systemctl restart zerg-luna-worker"

Workers reconnect automatically via ZMQ. No Sol restart needed.

Worker Connectivity Issues

If workers show as disconnected:

Check the worker process is running: systemctl status zerg-luna-worker
Check ZMQ gateway status: curl -s http://127.0.0.1:21434/api/v1/zmq/status
Verify port 5555 is listening: ss -tlnp | grep 5555
Check worker logs: journalctl -u zerg-luna-worker -n 50
Restart the worker: systemctl restart zerg-luna-worker

Workers send heartbeat messages and re-register automatically on reconnect.

Health Check Procedures

Endpoint Health

bash

curl -s http://127.0.0.1:21434/health | jq .

The health endpoint checks 20+ ETS tables, disk space, memory, and process counts.

Service Health

bash

systemctl status zerg-sol zerg-mango nginx postgresql

Full System Check

bash

zerg health
zerg discover
zerg zmq-workers
zerg models

Common Incident Response

Sol Continuously Restarting

Symptom: nginx 502 errors every few minutes.

Cause: systemd Type=forking with no PID file, or stale beam.smp processes.

bash

systemctl stop zerg-sol
pkill -u zerg beam.smp || true
pkill epmd || true
systemctl start zerg-sol

Ensure the systemd unit uses Type=simple with ExecStart=/opt/zerg/bin/sol foreground.

Mango Auth Failures

bash

systemctl status zerg-mango
journalctl -u zerg-mango -n 100
curl -s http://127.0.0.1:5800/health

Check the Mango DB path uses an absolute path: sqlite:///opt/zerg/mango/data/mango.db.

Inference Out of Memory

bash

curl -s http://127.0.0.1:21434/api/v1/inference/status | jq .

Reduce the model's context length or switch to a smaller model. The systemd memory limit is 4GB by default.

ETS Table Corruption

erlang

ets:info(sol_models).
ets:info(sol_global_catalog).

If tables are missing, restart Sol. ETS tables with {heir, Pid} survive owner crashes via the heir mechanism.

Monitoring

Prometheus Metrics

bash

curl -s http://127.0.0.1:21434/metrics

Enable with SOL_PROMETHEUS_ENABLED=true. Key metric families:

Metric	Description
`sol_http_requests_total`	HTTP request counter by method/path/status
`sol_inference_duration_seconds`	Inference latency histogram
`sol_zmq_workers_connected`	Currently connected ZMQ workers
`sol_zmq_tasks_dispatched_total`	Tasks dispatched to workers
`sol_scheduler_jobs_active`	Active scheduled jobs

SLO Endpoint

bash

curl -s http://127.0.0.1:21434/api/v1/slo | jq .

Grafana Dashboard

Grafana is available when using the monitoring Docker Compose profile:

bash

cd infra
docker compose --profile monitoring up -d

Configure the Prometheus datasource pointing to http://sol:21434/metrics.

Maintenance Mode

Enable maintenance mode to drain traffic gracefully:

bash

curl -X POST http://127.0.0.1:21434/api/v1/infra/maintenance/enable \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"reason": "scheduled maintenance"}'

Health and auth endpoints remain accessible during maintenance.

Disable when complete:

bash

curl -X POST http://127.0.0.1:21434/api/v1/infra/maintenance/disable \
  -H "Authorization: Bearer $TOKEN"

Cache Invalidation

After configuration changes, invalidate cached auth paths:

bash

curl -X POST http://127.0.0.1:21434/api/v1/infra/cache-invalidate \
  -H "Authorization: Bearer $TOKEN"

This refreshes persistent_term cached values for auth exempt paths and rate limit settings.

Production Runbook ​

Rolling Upgrades ​

Automated Deployment ​

Manual Rolling Upgrade ​

Health Check Verification ​

Automatic Rollback ​

Mnesia Backup and Restore ​

Create a Backup ​

Restore from Backup ​

Verify Mnesia Status ​

ZMQ Worker Management ​

Check Worker Status ​

Restart a Luna Worker ​

Worker Connectivity Issues ​

Health Check Procedures ​

Endpoint Health ​

Service Health ​

Full System Check ​

Common Incident Response ​

Sol Continuously Restarting ​

Mango Auth Failures ​

Inference Out of Memory ​

ETS Table Corruption ​

Monitoring ​

Prometheus Metrics ​

SLO Endpoint ​

Grafana Dashboard ​

Maintenance Mode ​

Cache Invalidation ​

Production Runbook

Rolling Upgrades

Automated Deployment

Manual Rolling Upgrade

Health Check Verification

Automatic Rollback

Mnesia Backup and Restore

Create a Backup

Restore from Backup

Verify Mnesia Status

ZMQ Worker Management

Check Worker Status

Restart a Luna Worker

Worker Connectivity Issues

Health Check Procedures

Endpoint Health

Service Health

Full System Check

Common Incident Response

Sol Continuously Restarting

Mango Auth Failures

Inference Out of Memory

ETS Table Corruption

Monitoring

Prometheus Metrics

SLO Endpoint

Grafana Dashboard

Maintenance Mode

Cache Invalidation