Health & SLO API

Sol provides comprehensive health checking, SLO compliance monitoring, and Prometheus metrics for production observability.

Health Check

GET /api/v1/health

Returns the full system health status across 12 subsystems. No authentication required (exempt path).

bash

curl https://api.nonsense.ws/api/v1/health

Healthy response (200):

json

{
  "status": "ok",
  "checks": [
    {"name": "mnesia", "status": "ok", "details": {"tables": 18}},
    {"name": "zmq", "status": "ok", "details": {"workers": 3}},
    {"name": "mango", "status": "ok", "details": {"enabled": true}},
    {"name": "postgres", "status": "ok", "details": {"configured": true}},
    {"name": "ets", "status": "ok", "details": {"tables": 14, "optional": 1}},
    {"name": "listeners", "status": "ok", "details": {"http_port": 11434}},
    {"name": "disk", "status": "ok", "details": {"checked": 3}},
    {"name": "process_util", "status": "ok", "details": {"count": 234, "limit": 262144, "ratio": 0.001}},
    {"name": "memory_util", "status": "ok", "details": {"total_bytes": 134217728, "binary_bytes": 8388608}},
    {"name": "message_queues", "status": "ok", "details": {"backlogged": 0, "threshold": 100}},
    {"name": "models", "status": "ok", "details": {"ready": 2}},
    {"name": "workers", "status": "ok", "details": {"stale": 0}}
  ]
}

Degraded response (503):

json

{
  "status": "degraded",
  "checks": [
    {"name": "mnesia", "status": "ok", "details": {"tables": 18}},
    {"name": "zmq", "status": "error", "details": {"reason": "gateway not responding"}},
    {"name": "disk", "status": "warning", "details": {"critical": [], "warning": ["priv/mnesia"], "checked": 3}}
  ]
}

Subsystems Checked

Check	Description	Error Conditions
`mnesia`	Mnesia database status	Not started, not running
`zmq`	ZMQ gateway and worker connections	Gateway not responding
`mango`	Mango auth service connectivity	Not connected
`postgres`	PostgreSQL connection pool	Health check failed
`ets`	14 core + 6 lazy + 1 optional ETS tables	Core tables missing
`listeners`	Cowboy HTTP listener bound to port	No listener
`disk`	Disk space for Mnesia, backups, logs	Below critical (100MB) or warning (1GB)
`process_util`	Erlang process count vs limit	Above 80% (warning), 95% (error)
`memory_util`	Total BEAM memory usage	Above 4GB (warning)
`message_queues`	Process mailbox backlog	Any process above 100 messages
`models`	Ready inference models	Check passes regardless
`workers`	ZMQ worker staleness (30s threshold)	Workers not seen recently

Liveness Probe

GET /health

Lightweight check that always returns 200 if the HTTP server is running. Suitable for Kubernetes liveness probes.

json

{
  "status": "ok"
}

Readiness Probe

GET /api/v1/ready

Runs the same checks as /api/v1/health but returns 200 only when all subsystems are healthy.

SLO Metrics

GET /api/v1/slo

Returns current SLO compliance status across 7 service level objectives.

bash

curl -H "Authorization: Bearer $TOKEN" \
  https://api.nonsense.ws/api/v1/slo

Response:

json

{
  "slos": [
    {"name": "api_availability", "target": 99.9, "actual": 99.95, "status": "healthy", "unit": "percent"},
    {"name": "p99_latency_ms", "target": 1000.0, "actual": 245.3, "status": "healthy", "unit": "ms"},
    {"name": "worker_crash_rate", "target": 1.0, "actual": 0.2, "status": "healthy", "unit": "percent"},
    {"name": "event_loss", "target": 0.0, "actual": 0.0, "status": "healthy", "unit": "events"},
    {"name": "zmq_message_loss", "target": 0.0, "actual": 0.0, "status": "healthy", "unit": "messages"},
    {"name": "circuit_breaker_opens", "target": 3.0, "actual": 0.0, "status": "healthy", "unit": "per_hour"},
    {"name": "memory_pressure", "target": 80.0, "actual": 134.2, "status": "healthy", "unit": "mb"}
  ]
}

SLO Status Values

Status	Meaning
`healthy`	Within target with margin
`degraded`	Within target but margin below 10%
`breached`	Target exceeded

SLO Definitions

SLO	Target	Unit	Source
API Availability	99.9%	percent	HTTP 2xx / total requests
P99 Latency	1000ms	ms	Latency histogram percentile
Worker Crash Rate	1.0%	percent	Crashed / dispatched tasks
Event Loss	0	events	Event store loss counter
ZMQ Message Loss	0	messages	ZMQ message loss counter
Circuit Breaker Opens	3/hour	per_hour	Circuit open counter
Memory Pressure	80MB	mb	BEAM total memory

Prometheus Metrics

GET /metrics

Exposes metrics in Prometheus exposition format (text/plain). No authentication required.

bash

curl https://api.nonsense.ws/metrics

Output format:

# HELP sol_http_requests_total Total counter for sol_http_requests_total
# TYPE sol_http_requests_total counter
sol_http_requests_total 1234

# HELP sol_errors_total Error counts by module and category
# TYPE sol_errors_total counter
sol_errors_total{module="sol_pipeline",category="timeout"} 3

# HELP sol_latency_histogram_request_duration Microsecond latency histogram
# TYPE sol_latency_histogram_request_duration summary
sol_latency_histogram_request_duration{quantile="0.5"} 12000
sol_latency_histogram_request_duration{quantile="0.9"} 45000
sol_latency_histogram_request_duration{quantile="0.99"} 245000

Metric Families

Family	Type	Description
`sol_*_total`	counter	Request, task, worker counters
`sol_errors_total`	counter	Errors by module and category
`sol_latency_*`	summary	Request latency histogram
`sol_circuit_breaker_state`	gauge	Circuit breaker open/closed
`sol_workers_*`	gauge	Worker count and status
`sol_models_*`	gauge	Model availability
`sol_uptime_seconds`	gauge	Server uptime
`sol_build_info`	gauge	Version and build metadata
`sol_process_*`	gauge	Erlang process metrics
`sol_memory_*`	gauge	BEAM memory metrics
`sol_ets_count`	gauge	ETS table row counts
`sol_task_queue_*`	gauge	Task queue depth
`sol_cluster_nodes`	gauge	Connected cluster nodes
`sol_availability_percent`	gauge	Calculated availability

Monitoring Integration

Grafana

Configure Prometheus as a datasource and import the ZERG dashboard. Key panels:

API request rate and error rate
P50/P90/P99 latency
Worker pool utilization
Model inference throughput
Circuit breaker state

Alerts

Sol exposes alert-friendly counters. Recommended alert rules:

Alert	Condition	Severity
HighErrorRate	`rate(sol_errors_total[5m]) > 0.05`	warning
HighLatency	`sol_latency_p99 > 2000`	warning
WorkerCrash	`rate(sol_worker_crashes[5m]) > 0.1`	critical
DiskSpace	`sol_disk_available_bytes < 1e9`	critical
MemoryPressure	`sol_memory_total_bytes > 4e9`	warning

OpenTelemetry

Enable OpenTelemetry tracing in sys.config:

erlang

{otel_enabled, true},
{otel_endpoint, "http://tempo:4318"}

Sol exports OTLP spans for HTTP requests, inference calls, and workflow steps.

Health & SLO API ​

Health Check ​

Subsystems Checked ​

Liveness Probe ​

Readiness Probe ​

SLO Metrics ​

SLO Status Values ​

SLO Definitions ​

Prometheus Metrics ​

Metric Families ​

Monitoring Integration ​

Grafana ​

Alerts ​

OpenTelemetry ​

Health & SLO API

Health Check

Subsystems Checked

Liveness Probe

Readiness Probe

SLO Metrics

SLO Status Values

SLO Definitions

Prometheus Metrics

Metric Families

Monitoring Integration

Grafana

Alerts

OpenTelemetry