Health & SLO API
Sol provides comprehensive health checking, SLO compliance monitoring, and Prometheus metrics for production observability.
Health Check
GET /api/v1/healthReturns the full system health status across 12 subsystems. No authentication required (exempt path).
curl https://api.nonsense.ws/api/v1/healthHealthy response (200):
{
"status": "ok",
"checks": [
{"name": "mnesia", "status": "ok", "details": {"tables": 18}},
{"name": "zmq", "status": "ok", "details": {"workers": 3}},
{"name": "mango", "status": "ok", "details": {"enabled": true}},
{"name": "postgres", "status": "ok", "details": {"configured": true}},
{"name": "ets", "status": "ok", "details": {"tables": 14, "optional": 1}},
{"name": "listeners", "status": "ok", "details": {"http_port": 11434}},
{"name": "disk", "status": "ok", "details": {"checked": 3}},
{"name": "process_util", "status": "ok", "details": {"count": 234, "limit": 262144, "ratio": 0.001}},
{"name": "memory_util", "status": "ok", "details": {"total_bytes": 134217728, "binary_bytes": 8388608}},
{"name": "message_queues", "status": "ok", "details": {"backlogged": 0, "threshold": 100}},
{"name": "models", "status": "ok", "details": {"ready": 2}},
{"name": "workers", "status": "ok", "details": {"stale": 0}}
]
}Degraded response (503):
{
"status": "degraded",
"checks": [
{"name": "mnesia", "status": "ok", "details": {"tables": 18}},
{"name": "zmq", "status": "error", "details": {"reason": "gateway not responding"}},
{"name": "disk", "status": "warning", "details": {"critical": [], "warning": ["priv/mnesia"], "checked": 3}}
]
}Subsystems Checked
| Check | Description | Error Conditions |
|---|---|---|
mnesia | Mnesia database status | Not started, not running |
zmq | ZMQ gateway and worker connections | Gateway not responding |
mango | Mango auth service connectivity | Not connected |
postgres | PostgreSQL connection pool | Health check failed |
ets | 14 core + 6 lazy + 1 optional ETS tables | Core tables missing |
listeners | Cowboy HTTP listener bound to port | No listener |
disk | Disk space for Mnesia, backups, logs | Below critical (100MB) or warning (1GB) |
process_util | Erlang process count vs limit | Above 80% (warning), 95% (error) |
memory_util | Total BEAM memory usage | Above 4GB (warning) |
message_queues | Process mailbox backlog | Any process above 100 messages |
models | Ready inference models | Check passes regardless |
workers | ZMQ worker staleness (30s threshold) | Workers not seen recently |
Liveness Probe
GET /healthLightweight check that always returns 200 if the HTTP server is running. Suitable for Kubernetes liveness probes.
{
"status": "ok"
}Readiness Probe
GET /api/v1/readyRuns the same checks as /api/v1/health but returns 200 only when all subsystems are healthy.
SLO Metrics
GET /api/v1/sloReturns current SLO compliance status across 7 service level objectives.
curl -H "Authorization: Bearer $TOKEN" \
https://api.nonsense.ws/api/v1/sloResponse:
{
"slos": [
{"name": "api_availability", "target": 99.9, "actual": 99.95, "status": "healthy", "unit": "percent"},
{"name": "p99_latency_ms", "target": 1000.0, "actual": 245.3, "status": "healthy", "unit": "ms"},
{"name": "worker_crash_rate", "target": 1.0, "actual": 0.2, "status": "healthy", "unit": "percent"},
{"name": "event_loss", "target": 0.0, "actual": 0.0, "status": "healthy", "unit": "events"},
{"name": "zmq_message_loss", "target": 0.0, "actual": 0.0, "status": "healthy", "unit": "messages"},
{"name": "circuit_breaker_opens", "target": 3.0, "actual": 0.0, "status": "healthy", "unit": "per_hour"},
{"name": "memory_pressure", "target": 80.0, "actual": 134.2, "status": "healthy", "unit": "mb"}
]
}SLO Status Values
| Status | Meaning |
|---|---|
healthy | Within target with margin |
degraded | Within target but margin below 10% |
breached | Target exceeded |
SLO Definitions
| SLO | Target | Unit | Source |
|---|---|---|---|
| API Availability | 99.9% | percent | HTTP 2xx / total requests |
| P99 Latency | 1000ms | ms | Latency histogram percentile |
| Worker Crash Rate | 1.0% | percent | Crashed / dispatched tasks |
| Event Loss | 0 | events | Event store loss counter |
| ZMQ Message Loss | 0 | messages | ZMQ message loss counter |
| Circuit Breaker Opens | 3/hour | per_hour | Circuit open counter |
| Memory Pressure | 80MB | mb | BEAM total memory |
Prometheus Metrics
GET /metricsExposes metrics in Prometheus exposition format (text/plain). No authentication required.
curl https://api.nonsense.ws/metricsOutput format:
# HELP sol_http_requests_total Total counter for sol_http_requests_total
# TYPE sol_http_requests_total counter
sol_http_requests_total 1234
# HELP sol_errors_total Error counts by module and category
# TYPE sol_errors_total counter
sol_errors_total{module="sol_pipeline",category="timeout"} 3
# HELP sol_latency_histogram_request_duration Microsecond latency histogram
# TYPE sol_latency_histogram_request_duration summary
sol_latency_histogram_request_duration{quantile="0.5"} 12000
sol_latency_histogram_request_duration{quantile="0.9"} 45000
sol_latency_histogram_request_duration{quantile="0.99"} 245000Metric Families
| Family | Type | Description |
|---|---|---|
sol_*_total | counter | Request, task, worker counters |
sol_errors_total | counter | Errors by module and category |
sol_latency_* | summary | Request latency histogram |
sol_circuit_breaker_state | gauge | Circuit breaker open/closed |
sol_workers_* | gauge | Worker count and status |
sol_models_* | gauge | Model availability |
sol_uptime_seconds | gauge | Server uptime |
sol_build_info | gauge | Version and build metadata |
sol_process_* | gauge | Erlang process metrics |
sol_memory_* | gauge | BEAM memory metrics |
sol_ets_count | gauge | ETS table row counts |
sol_task_queue_* | gauge | Task queue depth |
sol_cluster_nodes | gauge | Connected cluster nodes |
sol_availability_percent | gauge | Calculated availability |
Monitoring Integration
Grafana
Configure Prometheus as a datasource and import the ZERG dashboard. Key panels:
- API request rate and error rate
- P50/P90/P99 latency
- Worker pool utilization
- Model inference throughput
- Circuit breaker state
Alerts
Sol exposes alert-friendly counters. Recommended alert rules:
| Alert | Condition | Severity |
|---|---|---|
| HighErrorRate | rate(sol_errors_total[5m]) > 0.05 | warning |
| HighLatency | sol_latency_p99 > 2000 | warning |
| WorkerCrash | rate(sol_worker_crashes[5m]) > 0.1 | critical |
| DiskSpace | sol_disk_available_bytes < 1e9 | critical |
| MemoryPressure | sol_memory_total_bytes > 4e9 | warning |
OpenTelemetry
Enable OpenTelemetry tracing in sys.config:
{otel_enabled, true},
{otel_endpoint, "http://tempo:4318"}Sol exports OTLP spans for HTTP requests, inference calls, and workflow steps.