Inference API

Sol manages LLM inference instances through a provider adapter system. It supports local llama.cpp inference and 10 remote provider adapters with a unified API surface.

Start Inference

POST /api/v1/inference

Start a new inference instance with the specified model and configuration.

bash

curl -X POST https://api.nonsense.ws/api/v1/inference \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "gpu_layers": 99,
    "ctx_size": 4096,
    "threads": 4
  }'

Response:

json

{
  "ok": true,
  "id": "inf-1744730000000"
}

Stop Inference

POST /api/v1/stop

Stop a running inference instance.

bash

curl -X POST https://api.nonsense.ws/api/v1/stop \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"id": "inf-1744730000000"}'

Stop All Instances

POST /api/v1/stop-all

Stop all running inference instances.

List Models

GET /api/v1/models

List all discovered models and their status.

bash

curl -H "Authorization: Bearer $TOKEN" \
  https://api.nonsense.ws/api/v1/models

Response:

json

{
  "models": [
    {
      "name": "glm-5.1",
      "status": "loaded",
      "provider": "zai",
      "ctx_size": 4096
    }
  ]
}

Model Discovery

POST /api/v1/discover

Scan for available models and register them. Requires admin permissions.

bash

curl -X POST https://api.nonsense.ws/api/v1/discover \
  -H "Authorization: Bearer $TOKEN"

Response:

json

{
  "discovered": 5
}

Provider Adapters

Sol uses a behaviour-based provider system with 12 adapters. Each adapter implements a common interface for listing models, creating completions, and streaming tokens.

Supported Providers

Provider	Adapter	Capabilities
Anthropic	`sol_provider_anthropic`	Messages API, streaming, vision
OpenAI	`sol_provider_openai`	Chat Completions, streaming, function calling
z.ai	`sol_provider_zai`	GLM-5.1, streaming, vision
Ollama	`sol_provider_ollama`	Local models, streaming
Gemini	`sol_provider_gemini`	Google AI, streaming
DeepSeek	`sol_provider_deepseek`	Chat Completions, streaming
Alibaba	`sol_provider_alibaba`	Qwen models, streaming
Bedrock	`sol_provider_bedrock`	AWS Bedrock, streaming
Declarative	`sol_provider_declarative`	Config-driven adapter for custom endpoints

Provider Configuration

Providers are configured in sys.config:

erlang

{providers, [
  {anthropic, #{api_key => "sk-ant-..."}},
  {openai, #{api_key => "sk-..."}},
  {zai, #{api_key => "..."}},
  {ollama, #{base_url => "http://localhost:11434"}}
]}

Cross-Provider Message Transformation

Sol provides bidirectional message transformation between Anthropic and OpenAI formats:

Anthropic Messages API (/api/v1/messages)
OpenAI Chat Completions API (/api/v1/chat/completions)

This enables mid-conversation provider switching without format conversion on the client side.

Streaming (SSE)

All inference endpoints support Server-Sent Events streaming:

bash

curl -N https://api.nonsense.ws/api/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

SSE event format:

data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"Hello"},"index":0}]}

data: [DONE]

Streaming features:

Heartbeat messages to prevent connection timeouts
Token-by-token delivery for real-time display
Graceful connection termination on completion or error

Inference Lifecycle

Inference instances use gen_statem for lifecycle management:

stopped -> starting -> ready -> running -> ready
                  \-> failed

State	Description
`stopped`	Instance not running
`starting`	Model loading in progress
`ready`	Instance available for requests
`running`	Processing a request
`failed`	Startup or runtime error

Inference Configuration

Setting	Default	Description
`inference_bin`	`./llama-server`	llama.cpp server binary
`inference_gpu_layers`	`99`	GPU layers to offload
`inference_ctx_size`	`4096`	Context window size
`inference_threads`	`4`	CPU threads

Local Inference (llama.cpp)

Sol manages local llama.cpp inference instances via erlexec:

Automatic process lifecycle (start, monitor, restart)
GPU offloading configuration
Multiple concurrent instances on different ports
Process monitoring using Erlang monitors (not polling)

Local inference is the default. Remote providers serve as fallback when local models are unavailable.

Inference API ​

Start Inference ​

Stop Inference ​

Stop All Instances ​

List Models ​

Model Discovery ​

Provider Adapters ​

Supported Providers ​

Provider Configuration ​

Cross-Provider Message Transformation ​

Streaming (SSE) ​

Inference Lifecycle ​

Inference Configuration ​

Local Inference (llama.cpp) ​

Inference API

Start Inference

Stop Inference

Stop All Instances

List Models

Model Discovery

Provider Adapters

Supported Providers

Provider Configuration

Cross-Provider Message Transformation

Streaming (SSE)

Inference Lifecycle

Inference Configuration

Local Inference (llama.cpp)