Skip to content

Troubleshooting

Common issues and solutions for ZERG deployments. Each section covers symptoms, root cause, and resolution steps.

Worker Connectivity

Worker Won't Connect via ZMQ

Symptom: Worker process is running but does not appear in zerg zmq-workers.

Resolution:

  1. Verify the ZMQ gateway is enabled and listening:
bash
ss -tlnp | grep 5555
curl -s http://127.0.0.1:21434/api/v1/zmq/status | jq .
  1. Check the worker's ZMQ connection string matches the server:
bash
zerg zmq-status
  1. Check for stale processes blocking the port:
bash
lsof -i :5555
  1. Check firewall rules if connecting across hosts:
bash
iptables -L -n | grep 5555

Worker Disconnects Intermittently

Symptom: Workers appear and disappear from the worker list.

Resolution: Check the heartbeat configuration. Workers must send ready messages periodically (default every 5 seconds). If the heartbeat timeout is too short:

erlang
application:get_env(sol, worker_heartbeat_timeout_ms, 15000).

Increase the timeout or reduce network latency between worker and server.

Authentication Failures

Token Expired

Symptom: 401 Unauthorized on previously working requests.

Resolution: Refresh the token:

bash
zerg login --url https://api.nonsense.ws --token <new-token>

Or via the API:

bash
curl -X POST https://api.nonsense.ws/api/v1/auth/refresh \
  -H "Authorization: Bearer $TOKEN"

Mango Service Down

Symptom: All authenticated endpoints return 503 or timeout.

Resolution:

bash
systemctl status zerg-mango
journalctl -u zerg-mango -n 50
curl -s http://127.0.0.1:5800/health

Common causes:

  • Mango DB path uses relative path instead of absolute. Ensure MANGO_DB_URL=sqlite:///opt/zerg/mango/data/mango.db
  • Python dependencies missing: pip3 install tornado bcrypt pyjwt aiosqlite passlib
  • Port 5800 already in use

403 Forbidden on Admin Endpoints

Symptom: Admin user cannot access infrastructure endpoints.

Resolution: Verify the RBAC chain is seeded:

bash
cd /opt/zerg/mango
python3 scripts/bootstrap_admin.py --auto
systemctl restart zerg-mango

The admin user must have system-admin team membership with admin role containing permissions: ["*"].

Inference Errors

Model Not Found

Symptom: "model not found" error when starting inference.

Resolution:

bash
zerg models
ls -la /opt/zerg/data/models/

Ensure model files are in the directory configured by model_dirs in sys.config:

erlang
{model_dirs, ["/opt/zerg/data/models"]}

Out of Memory

Symptom: Inference process crashes, OOM killer invoked.

Resolution:

  1. Check current memory usage:
bash
free -h
cat /proc/$(pgrep beam)/status | grep VmRSS
  1. Reduce context length in the model configuration
  2. Switch to a smaller model
  3. Increase the systemd memory limit:
ini
MemoryMax=8G

Inference Hangs

Symptom: Requests to inference endpoints never return.

Resolution:

bash
curl -s http://127.0.0.1:21434/api/v1/inference/status | jq .

Check if the llama.cpp process is running:

bash
ps aux | grep llama

Kill and restart the instance:

bash
zerg inference stop <instance-id>
zerg inference start --model <model-name>

Slow Responses

High Latency on API Endpoints

Diagnosis:

bash
curl -s http://127.0.0.1:21434/metrics | grep sol_http_request_duration
curl -s http://127.0.0.1:21434/api/v1/slo | jq .

Common causes:

  • ETS table scan on large datasets (check table sizes)
  • Slow Mango token validation (check Mango response times)
  • Blocked gen_server:call timeouts (check for processes stuck in call)

Slow Inference

Check the inference latency histogram:

bash
curl -s http://127.0.0.1:21434/metrics | grep sol_inference_duration

Consider switching to a smaller model or reducing batch size.

ETS Table Issues

Table Not Found

Symptom: badarg errors in logs referencing ETS tables.

Resolution: ETS tables with heir protection survive owner crashes. If the heir owner also crashed, restart Sol:

bash
systemctl restart zerg-sol

24 ETS tables are managed under sol_owners_sup.

Large Table Sizes

Check table sizes:

erlang
mnesia:table_info(sol_events, size).
ets:info(sol_models, size).

For event store growth, consider archiving old events.

Log Locations

ComponentLocation
Sol serverjournalctl -u zerg-sol
Mango authjournalctl -u zerg-mango
Luna workerjournalctl -u zerg-luna-worker
nginx/var/log/nginx/error.log
PostgreSQL/var/log/postgresql/
Sol application logs/opt/zerg/log/

Increase Log Level

erlang
logger:set_module_level(sol_zmq_gateway, debug).
logger:set_module_level(sol_auth_middleware, debug).

Reset to default:

erlang
logger:unset_module_level(sol_zmq_gateway).

Debugging Tips

Useful CLI Commands

bash
zerg health
zerg discover
zerg models
zerg zmq-workers
zerg zmq-status
zerg agents
zerg task-status <id>
zerg config get default.token

Remote Erlang Console

bash
ssh root@blue "su - zerg -c '/opt/zerg/bin/sol remote'"

Useful diagnostic commands:

erlang
application:which_applications().
supervisor:count_children(sol_sup).
process_info(whereis(sol_zmq_gateway), message_queue_len).
ets:info(sol_models).

Check Process Mailbox Backlog

erlang
lists:foreach(fun(Pid) ->
    case process_info(Pid, message_queue_len) of
        {message_queue_len, N} when N > 100 ->
            io:format("~p ~p backlog: ~p~n", [Pid, process_info(Pid, registered_name), N]);
        _ -> ok
    end
end, processes()).

Check gen_server:call Timeouts

Search logs for timeout errors:

bash
journalctl -u zerg-sol --since "1 hour ago" | grep -i timeout

All gen_server:call operations use a ?CALL_TIMEOUT constant (default 5000ms). If you see timeouts, the target process may be blocked.

Released under the MIT License.