Inference API
Sol manages LLM inference instances through a provider adapter system. It supports local llama.cpp inference and 10 remote provider adapters with a unified API surface.
Start Inference
POST /api/v1/inferenceStart a new inference instance with the specified model and configuration.
curl -X POST https://api.nonsense.ws/api/v1/inference \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.1",
"gpu_layers": 99,
"ctx_size": 4096,
"threads": 4
}'Response:
{
"ok": true,
"id": "inf-1744730000000"
}Stop Inference
POST /api/v1/stopStop a running inference instance.
curl -X POST https://api.nonsense.ws/api/v1/stop \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "inf-1744730000000"}'Stop All Instances
POST /api/v1/stop-allStop all running inference instances.
List Models
GET /api/v1/modelsList all discovered models and their status.
curl -H "Authorization: Bearer $TOKEN" \
https://api.nonsense.ws/api/v1/modelsResponse:
{
"models": [
{
"name": "glm-5.1",
"status": "loaded",
"provider": "zai",
"ctx_size": 4096
}
]
}Model Discovery
POST /api/v1/discoverScan for available models and register them. Requires admin permissions.
curl -X POST https://api.nonsense.ws/api/v1/discover \
-H "Authorization: Bearer $TOKEN"Response:
{
"discovered": 5
}Provider Adapters
Sol uses a behaviour-based provider system with 12 adapters. Each adapter implements a common interface for listing models, creating completions, and streaming tokens.
Supported Providers
| Provider | Adapter | Capabilities |
|---|---|---|
| Anthropic | sol_provider_anthropic | Messages API, streaming, vision |
| OpenAI | sol_provider_openai | Chat Completions, streaming, function calling |
| z.ai | sol_provider_zai | GLM-5.1, streaming, vision |
| Ollama | sol_provider_ollama | Local models, streaming |
| Gemini | sol_provider_gemini | Google AI, streaming |
| DeepSeek | sol_provider_deepseek | Chat Completions, streaming |
| Alibaba | sol_provider_alibaba | Qwen models, streaming |
| Bedrock | sol_provider_bedrock | AWS Bedrock, streaming |
| Declarative | sol_provider_declarative | Config-driven adapter for custom endpoints |
Provider Configuration
Providers are configured in sys.config:
{providers, [
{anthropic, #{api_key => "sk-ant-..."}},
{openai, #{api_key => "sk-..."}},
{zai, #{api_key => "..."}},
{ollama, #{base_url => "http://localhost:11434"}}
]}Cross-Provider Message Transformation
Sol provides bidirectional message transformation between Anthropic and OpenAI formats:
- Anthropic Messages API (
/api/v1/messages) - OpenAI Chat Completions API (
/api/v1/chat/completions)
This enables mid-conversation provider switching without format conversion on the client side.
Streaming (SSE)
All inference endpoints support Server-Sent Events streaming:
curl -N https://api.nonsense.ws/api/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.1",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'SSE event format:
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"Hello"},"index":0}]}
data: [DONE]Streaming features:
- Heartbeat messages to prevent connection timeouts
- Token-by-token delivery for real-time display
- Graceful connection termination on completion or error
Inference Lifecycle
Inference instances use gen_statem for lifecycle management:
stopped -> starting -> ready -> running -> ready
\-> failed| State | Description |
|---|---|
stopped | Instance not running |
starting | Model loading in progress |
ready | Instance available for requests |
running | Processing a request |
failed | Startup or runtime error |
Inference Configuration
| Setting | Default | Description |
|---|---|---|
inference_bin | ./llama-server | llama.cpp server binary |
inference_gpu_layers | 99 | GPU layers to offload |
inference_ctx_size | 4096 | Context window size |
inference_threads | 4 | CPU threads |
Local Inference (llama.cpp)
Sol manages local llama.cpp inference instances via erlexec:
- Automatic process lifecycle (start, monitor, restart)
- GPU offloading configuration
- Multiple concurrent instances on different ports
- Process monitoring using Erlang monitors (not polling)
Local inference is the default. Remote providers serve as fallback when local models are unavailable.