Skip to content

Health & SLO API

Sol provides comprehensive health checking, SLO compliance monitoring, and Prometheus metrics for production observability.

Health Check

GET /api/v1/health

Returns the full system health status across 12 subsystems. No authentication required (exempt path).

bash
curl https://api.nonsense.ws/api/v1/health

Healthy response (200):

json
{
  "status": "ok",
  "checks": [
    {"name": "mnesia", "status": "ok", "details": {"tables": 18}},
    {"name": "zmq", "status": "ok", "details": {"workers": 3}},
    {"name": "mango", "status": "ok", "details": {"enabled": true}},
    {"name": "postgres", "status": "ok", "details": {"configured": true}},
    {"name": "ets", "status": "ok", "details": {"tables": 14, "optional": 1}},
    {"name": "listeners", "status": "ok", "details": {"http_port": 11434}},
    {"name": "disk", "status": "ok", "details": {"checked": 3}},
    {"name": "process_util", "status": "ok", "details": {"count": 234, "limit": 262144, "ratio": 0.001}},
    {"name": "memory_util", "status": "ok", "details": {"total_bytes": 134217728, "binary_bytes": 8388608}},
    {"name": "message_queues", "status": "ok", "details": {"backlogged": 0, "threshold": 100}},
    {"name": "models", "status": "ok", "details": {"ready": 2}},
    {"name": "workers", "status": "ok", "details": {"stale": 0}}
  ]
}

Degraded response (503):

json
{
  "status": "degraded",
  "checks": [
    {"name": "mnesia", "status": "ok", "details": {"tables": 18}},
    {"name": "zmq", "status": "error", "details": {"reason": "gateway not responding"}},
    {"name": "disk", "status": "warning", "details": {"critical": [], "warning": ["priv/mnesia"], "checked": 3}}
  ]
}

Subsystems Checked

CheckDescriptionError Conditions
mnesiaMnesia database statusNot started, not running
zmqZMQ gateway and worker connectionsGateway not responding
mangoMango auth service connectivityNot connected
postgresPostgreSQL connection poolHealth check failed
ets14 core + 6 lazy + 1 optional ETS tablesCore tables missing
listenersCowboy HTTP listener bound to portNo listener
diskDisk space for Mnesia, backups, logsBelow critical (100MB) or warning (1GB)
process_utilErlang process count vs limitAbove 80% (warning), 95% (error)
memory_utilTotal BEAM memory usageAbove 4GB (warning)
message_queuesProcess mailbox backlogAny process above 100 messages
modelsReady inference modelsCheck passes regardless
workersZMQ worker staleness (30s threshold)Workers not seen recently

Liveness Probe

GET /health

Lightweight check that always returns 200 if the HTTP server is running. Suitable for Kubernetes liveness probes.

json
{
  "status": "ok"
}

Readiness Probe

GET /api/v1/ready

Runs the same checks as /api/v1/health but returns 200 only when all subsystems are healthy.

SLO Metrics

GET /api/v1/slo

Returns current SLO compliance status across 7 service level objectives.

bash
curl -H "Authorization: Bearer $TOKEN" \
  https://api.nonsense.ws/api/v1/slo

Response:

json
{
  "slos": [
    {"name": "api_availability", "target": 99.9, "actual": 99.95, "status": "healthy", "unit": "percent"},
    {"name": "p99_latency_ms", "target": 1000.0, "actual": 245.3, "status": "healthy", "unit": "ms"},
    {"name": "worker_crash_rate", "target": 1.0, "actual": 0.2, "status": "healthy", "unit": "percent"},
    {"name": "event_loss", "target": 0.0, "actual": 0.0, "status": "healthy", "unit": "events"},
    {"name": "zmq_message_loss", "target": 0.0, "actual": 0.0, "status": "healthy", "unit": "messages"},
    {"name": "circuit_breaker_opens", "target": 3.0, "actual": 0.0, "status": "healthy", "unit": "per_hour"},
    {"name": "memory_pressure", "target": 80.0, "actual": 134.2, "status": "healthy", "unit": "mb"}
  ]
}

SLO Status Values

StatusMeaning
healthyWithin target with margin
degradedWithin target but margin below 10%
breachedTarget exceeded

SLO Definitions

SLOTargetUnitSource
API Availability99.9%percentHTTP 2xx / total requests
P99 Latency1000msmsLatency histogram percentile
Worker Crash Rate1.0%percentCrashed / dispatched tasks
Event Loss0eventsEvent store loss counter
ZMQ Message Loss0messagesZMQ message loss counter
Circuit Breaker Opens3/hourper_hourCircuit open counter
Memory Pressure80MBmbBEAM total memory

Prometheus Metrics

GET /metrics

Exposes metrics in Prometheus exposition format (text/plain). No authentication required.

bash
curl https://api.nonsense.ws/metrics

Output format:

# HELP sol_http_requests_total Total counter for sol_http_requests_total
# TYPE sol_http_requests_total counter
sol_http_requests_total 1234

# HELP sol_errors_total Error counts by module and category
# TYPE sol_errors_total counter
sol_errors_total{module="sol_pipeline",category="timeout"} 3

# HELP sol_latency_histogram_request_duration Microsecond latency histogram
# TYPE sol_latency_histogram_request_duration summary
sol_latency_histogram_request_duration{quantile="0.5"} 12000
sol_latency_histogram_request_duration{quantile="0.9"} 45000
sol_latency_histogram_request_duration{quantile="0.99"} 245000

Metric Families

FamilyTypeDescription
sol_*_totalcounterRequest, task, worker counters
sol_errors_totalcounterErrors by module and category
sol_latency_*summaryRequest latency histogram
sol_circuit_breaker_stategaugeCircuit breaker open/closed
sol_workers_*gaugeWorker count and status
sol_models_*gaugeModel availability
sol_uptime_secondsgaugeServer uptime
sol_build_infogaugeVersion and build metadata
sol_process_*gaugeErlang process metrics
sol_memory_*gaugeBEAM memory metrics
sol_ets_countgaugeETS table row counts
sol_task_queue_*gaugeTask queue depth
sol_cluster_nodesgaugeConnected cluster nodes
sol_availability_percentgaugeCalculated availability

Monitoring Integration

Grafana

Configure Prometheus as a datasource and import the ZERG dashboard. Key panels:

  • API request rate and error rate
  • P50/P90/P99 latency
  • Worker pool utilization
  • Model inference throughput
  • Circuit breaker state

Alerts

Sol exposes alert-friendly counters. Recommended alert rules:

AlertConditionSeverity
HighErrorRaterate(sol_errors_total[5m]) > 0.05warning
HighLatencysol_latency_p99 > 2000warning
WorkerCrashrate(sol_worker_crashes[5m]) > 0.1critical
DiskSpacesol_disk_available_bytes < 1e9critical
MemoryPressuresol_memory_total_bytes > 4e9warning

OpenTelemetry

Enable OpenTelemetry tracing in sys.config:

erlang
{otel_enabled, true},
{otel_endpoint, "http://tempo:4318"}

Sol exports OTLP spans for HTTP requests, inference calls, and workflow steps.

Released under the MIT License.