Skip to content

Inference API

Sol manages LLM inference instances through a provider adapter system. It supports local llama.cpp inference and 10 remote provider adapters with a unified API surface.

Start Inference

POST /api/v1/inference

Start a new inference instance with the specified model and configuration.

bash
curl -X POST https://api.nonsense.ws/api/v1/inference \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "gpu_layers": 99,
    "ctx_size": 4096,
    "threads": 4
  }'

Response:

json
{
  "ok": true,
  "id": "inf-1744730000000"
}

Stop Inference

POST /api/v1/stop

Stop a running inference instance.

bash
curl -X POST https://api.nonsense.ws/api/v1/stop \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"id": "inf-1744730000000"}'

Stop All Instances

POST /api/v1/stop-all

Stop all running inference instances.

List Models

GET /api/v1/models

List all discovered models and their status.

bash
curl -H "Authorization: Bearer $TOKEN" \
  https://api.nonsense.ws/api/v1/models

Response:

json
{
  "models": [
    {
      "name": "glm-5.1",
      "status": "loaded",
      "provider": "zai",
      "ctx_size": 4096
    }
  ]
}

Model Discovery

POST /api/v1/discover

Scan for available models and register them. Requires admin permissions.

bash
curl -X POST https://api.nonsense.ws/api/v1/discover \
  -H "Authorization: Bearer $TOKEN"

Response:

json
{
  "discovered": 5
}

Provider Adapters

Sol uses a behaviour-based provider system with 12 adapters. Each adapter implements a common interface for listing models, creating completions, and streaming tokens.

Supported Providers

ProviderAdapterCapabilities
Anthropicsol_provider_anthropicMessages API, streaming, vision
OpenAIsol_provider_openaiChat Completions, streaming, function calling
z.aisol_provider_zaiGLM-5.1, streaming, vision
Ollamasol_provider_ollamaLocal models, streaming
Geminisol_provider_geminiGoogle AI, streaming
DeepSeeksol_provider_deepseekChat Completions, streaming
Alibabasol_provider_alibabaQwen models, streaming
Bedrocksol_provider_bedrockAWS Bedrock, streaming
Declarativesol_provider_declarativeConfig-driven adapter for custom endpoints

Provider Configuration

Providers are configured in sys.config:

erlang
{providers, [
  {anthropic, #{api_key => "sk-ant-..."}},
  {openai, #{api_key => "sk-..."}},
  {zai, #{api_key => "..."}},
  {ollama, #{base_url => "http://localhost:11434"}}
]}

Cross-Provider Message Transformation

Sol provides bidirectional message transformation between Anthropic and OpenAI formats:

  • Anthropic Messages API (/api/v1/messages)
  • OpenAI Chat Completions API (/api/v1/chat/completions)

This enables mid-conversation provider switching without format conversion on the client side.

Streaming (SSE)

All inference endpoints support Server-Sent Events streaming:

bash
curl -N https://api.nonsense.ws/api/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

SSE event format:

data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"Hello"},"index":0}]}

data: [DONE]

Streaming features:

  • Heartbeat messages to prevent connection timeouts
  • Token-by-token delivery for real-time display
  • Graceful connection termination on completion or error

Inference Lifecycle

Inference instances use gen_statem for lifecycle management:

stopped -> starting -> ready -> running -> ready
                  \-> failed
StateDescription
stoppedInstance not running
startingModel loading in progress
readyInstance available for requests
runningProcessing a request
failedStartup or runtime error

Inference Configuration

SettingDefaultDescription
inference_bin./llama-serverllama.cpp server binary
inference_gpu_layers99GPU layers to offload
inference_ctx_size4096Context window size
inference_threads4CPU threads

Local Inference (llama.cpp)

Sol manages local llama.cpp inference instances via erlexec:

  • Automatic process lifecycle (start, monitor, restart)
  • GPU offloading configuration
  • Multiple concurrent instances on different ports
  • Process monitoring using Erlang monitors (not polling)

Local inference is the default. Remote providers serve as fallback when local models are unavailable.

Released under the MIT License.