Troubleshooting
Common issues and solutions for ZERG deployments. Each section covers symptoms, root cause, and resolution steps.
Worker Connectivity
Worker Won't Connect via ZMQ
Symptom: Worker process is running but does not appear in zerg zmq-workers.
Resolution:
- Verify the ZMQ gateway is enabled and listening:
ss -tlnp | grep 5555
curl -s http://127.0.0.1:21434/api/v1/zmq/status | jq .- Check the worker's ZMQ connection string matches the server:
zerg zmq-status- Check for stale processes blocking the port:
lsof -i :5555- Check firewall rules if connecting across hosts:
iptables -L -n | grep 5555Worker Disconnects Intermittently
Symptom: Workers appear and disappear from the worker list.
Resolution: Check the heartbeat configuration. Workers must send ready messages periodically (default every 5 seconds). If the heartbeat timeout is too short:
application:get_env(sol, worker_heartbeat_timeout_ms, 15000).Increase the timeout or reduce network latency between worker and server.
Authentication Failures
Token Expired
Symptom: 401 Unauthorized on previously working requests.
Resolution: Refresh the token:
zerg login --url https://api.nonsense.ws --token <new-token>Or via the API:
curl -X POST https://api.nonsense.ws/api/v1/auth/refresh \
-H "Authorization: Bearer $TOKEN"Mango Service Down
Symptom: All authenticated endpoints return 503 or timeout.
Resolution:
systemctl status zerg-mango
journalctl -u zerg-mango -n 50
curl -s http://127.0.0.1:5800/healthCommon causes:
- Mango DB path uses relative path instead of absolute. Ensure
MANGO_DB_URL=sqlite:///opt/zerg/mango/data/mango.db - Python dependencies missing:
pip3 install tornado bcrypt pyjwt aiosqlite passlib - Port 5800 already in use
403 Forbidden on Admin Endpoints
Symptom: Admin user cannot access infrastructure endpoints.
Resolution: Verify the RBAC chain is seeded:
cd /opt/zerg/mango
python3 scripts/bootstrap_admin.py --auto
systemctl restart zerg-mangoThe admin user must have system-admin team membership with admin role containing permissions: ["*"].
Inference Errors
Model Not Found
Symptom: "model not found" error when starting inference.
Resolution:
zerg models
ls -la /opt/zerg/data/models/Ensure model files are in the directory configured by model_dirs in sys.config:
{model_dirs, ["/opt/zerg/data/models"]}Out of Memory
Symptom: Inference process crashes, OOM killer invoked.
Resolution:
- Check current memory usage:
free -h
cat /proc/$(pgrep beam)/status | grep VmRSS- Reduce context length in the model configuration
- Switch to a smaller model
- Increase the systemd memory limit:
MemoryMax=8GInference Hangs
Symptom: Requests to inference endpoints never return.
Resolution:
curl -s http://127.0.0.1:21434/api/v1/inference/status | jq .Check if the llama.cpp process is running:
ps aux | grep llamaKill and restart the instance:
zerg inference stop <instance-id>
zerg inference start --model <model-name>Slow Responses
High Latency on API Endpoints
Diagnosis:
curl -s http://127.0.0.1:21434/metrics | grep sol_http_request_duration
curl -s http://127.0.0.1:21434/api/v1/slo | jq .Common causes:
- ETS table scan on large datasets (check table sizes)
- Slow Mango token validation (check Mango response times)
- Blocked gen_server:call timeouts (check for processes stuck in call)
Slow Inference
Check the inference latency histogram:
curl -s http://127.0.0.1:21434/metrics | grep sol_inference_durationConsider switching to a smaller model or reducing batch size.
ETS Table Issues
Table Not Found
Symptom: badarg errors in logs referencing ETS tables.
Resolution: ETS tables with heir protection survive owner crashes. If the heir owner also crashed, restart Sol:
systemctl restart zerg-sol24 ETS tables are managed under sol_owners_sup.
Large Table Sizes
Check table sizes:
mnesia:table_info(sol_events, size).
ets:info(sol_models, size).For event store growth, consider archiving old events.
Log Locations
| Component | Location |
|---|---|
| Sol server | journalctl -u zerg-sol |
| Mango auth | journalctl -u zerg-mango |
| Luna worker | journalctl -u zerg-luna-worker |
| nginx | /var/log/nginx/error.log |
| PostgreSQL | /var/log/postgresql/ |
| Sol application logs | /opt/zerg/log/ |
Increase Log Level
logger:set_module_level(sol_zmq_gateway, debug).
logger:set_module_level(sol_auth_middleware, debug).Reset to default:
logger:unset_module_level(sol_zmq_gateway).Debugging Tips
Useful CLI Commands
zerg health
zerg discover
zerg models
zerg zmq-workers
zerg zmq-status
zerg agents
zerg task-status <id>
zerg config get default.tokenRemote Erlang Console
ssh root@blue "su - zerg -c '/opt/zerg/bin/sol remote'"Useful diagnostic commands:
application:which_applications().
supervisor:count_children(sol_sup).
process_info(whereis(sol_zmq_gateway), message_queue_len).
ets:info(sol_models).Check Process Mailbox Backlog
lists:foreach(fun(Pid) ->
case process_info(Pid, message_queue_len) of
{message_queue_len, N} when N > 100 ->
io:format("~p ~p backlog: ~p~n", [Pid, process_info(Pid, registered_name), N]);
_ -> ok
end
end, processes()).Check gen_server:call Timeouts
Search logs for timeout errors:
journalctl -u zerg-sol --since "1 hour ago" | grep -i timeoutAll gen_server:call operations use a ?CALL_TIMEOUT constant (default 5000ms). If you see timeouts, the target process may be blocked.