Lemonade Server Spec
The Lemonade Server is a standards-compliant server process that provides an HTTP API to enable integration with other applications.
Lemonade Server currently supports these backends:
| Backend | Model Format | Description |
|---|---|---|
| Llama.cpp | .GGUF |
Uses llama.cpp's llama-server backend. More details here. |
| ONNX Runtime GenAI (OGA) | .ONNX |
Uses Lemonade's own ryzenai-server backend. |
| FastFlowLM | .q4nx |
Uses FLM's flm serve backend. More details here. |
| whisper.cpp | .bin |
Uses whisper.cpp's whisper-server backend for audio transcription. Models: Whisper-Tiny, Whisper-Base, Whisper-Small. |
| stable-diffusion.cpp | .safetensors |
Uses sd.cpp's sd-cli backend for image generation. Models: SD-Turbo, SDXL-Turbo, etc. |
| Kokoros | .onnx |
Uses Kokoro's koko backend for speech generation. Models: kokoro-v1 |
Endpoints Overview
The key endpoints of the OpenAI API are available.
We are also actively investigating and developing additional endpoints that will improve the experience of local applications.
OpenAI-Compatible Endpoints
- POST
/api/v1/chat/completions- Chat Completions (messages -> completion) - POST
/api/v1/completions- Text Completions (prompt -> completion) - POST
/api/v1/embeddings- Embeddings (text -> vector representations) - POST
/api/v1/responses- Chat Completions (prompt|messages -> event) - POST
/api/v1/audio/transcriptions- Audio Transcription (audio file -> text) - POST
/api/v1/audio/speech- Text to speech (text -> audio) - WS
/realtime- Realtime Audio Transcription (streaming audio -> text, OpenAI SDK compatible) - WS
/logs/stream- Log Streaming (subscribe -> snapshot + live log entries) - POST
/api/v1/images/generations- Image Generation (prompt -> image) - POST
/api/v1/images/edits- Image Editing (image + prompt -> edited image) - POST
/api/v1/images/variations- Image Variations (image -> varied image) - POST
/api/v1/images/upscale- Image Upscaling (image + ESRGAN model -> upscaled image) - GET
/api/v1/models- List models available locally - GET
/api/v1/models/{model_id}- Retrieve a specific model by ID
llama.cpp Endpoints
These endpoints defined by llama.cpp extend the OpenAI-compatible API with additional functionality.
- POST
/api/v1/reranking- Reranking (query + documents -> relevance-scored documents)
Lemonade-Specific Endpoints
We have designed a set of Lemonade-specific endpoints to enable client applications by extending the existing cloud-focused APIs (e.g., OpenAI). These extensions allow for a greater degree of UI/UX responsiveness in native applications by allowing applications to:
- Download models at setup time.
- Pre-load models at UI-loading-time, as opposed to completion-request time.
- Unload models to save memory space.
- Understand system resources and state to make dynamic choices.
The additional endpoints are:
- POST
/api/v1/install- Install or update a backend - POST
/api/v1/uninstall- Remove a backend - POST
/api/v1/pull- Install a model - GET
/api/v1/pull/variants- Enumerate GGUF variants for a Hugging Face checkpoint - POST
/api/v1/delete- Delete a model - POST
/api/v1/load- Load a model - POST
/api/v1/unload- Unload a model - GET
/api/v1/health- Check server status, such as models loaded - GET
/api/v1/stats- Performance statistics from the last request - GET
/api/v1/system-info- System information and device enumeration - GET
/live- Check server liveness for load balancers and orchestrators
Ollama-Compatible API
Lemonade supports the Ollama API, allowing applications built for Ollama to work with Lemonade without modification.
To enable auto-detection by Ollama-integrated apps, configure the server to use the Ollama default port. See Server Configuration for how to change the port.
| Endpoint | Status | Notes |
|---|---|---|
POST /api/chat |
Supported | Streaming and non-streaming |
POST /api/generate |
Supported | Text completion + image generation |
GET /api/tags |
Supported | Lists downloaded models |
POST /api/show |
Supported | Model details |
DELETE /api/delete |
Supported | |
POST /api/pull |
Supported | Download with progress |
POST /api/embed |
Supported | New embeddings format |
POST /api/embeddings |
Supported | Legacy embeddings |
GET /api/ps |
Supported | Running models |
GET /api/version |
Supported | |
POST /api/create |
Not supported | Returns 501 |
POST /api/copy |
Not supported | Returns 501 |
POST /api/push |
Not supported | Returns 501 |
Anthropic-Compatible API (Initial)
Lemonade supports an initial Anthropic Messages compatibility endpoint for applications that call Claude-style APIs.
| Endpoint | Status | Notes |
|---|---|---|
POST /v1/messages |
Supported | Supports both streaming and non-streaming. Query params like ?beta=true are accepted. |
Current scope focuses on message generation parity for common fields (model, messages, system, max_tokens, temperature, stream, and basic tools). Unsupported or unimplemented Anthropic-specific fields are ignored and surfaced via warning logs/headers.
Multi-Model Support
Lemonade Server supports loading multiple models simultaneously, allowing you to keep frequently-used models in memory for faster switching. The server uses a Least Recently Used (LRU) cache policy to automatically manage model eviction when limits are reached.
Configuration
Configure via lemonade config set max_loaded_models=N. See Server Configuration.
Default: 1 (one model of each type). Use -1 for unlimited.
Model Types
Models are categorized into these types:
- LLM - Chat and completion models (default type)
- Embedding - Models for generating text embeddings (identified by the
embeddingslabel) - Reranking - Models for document reranking (identified by the
rerankinglabel) - Audio - Models for audio transcription using Whisper (identified by the
audiolabel) - Image - Models for image generation (identified by the
imagelabel)
Each type has its own independent LRU cache, all sharing the same slot limit set by max_loaded_models.
Device Constraints
- NPU Exclusivity:
flm,ryzenai-llm, andwhispercppare mutually exclusive on the NPU.- Loading a model from one of these backends will automatically evict all NPU models from the other backends.
flmsupports loading 1 ASR model, 1 LLM, and 1 embedding model on the NPU at the same time.ryzenai-llmsupports loading exactly 1 LLM, which uses the entire NPU.whispercppsupports loading exactly 1 ASR model at a time, which uses the entire NPU.
- CPU/GPU: No inherent limits beyond available RAM. Multiple models can coexist on CPU or GPU.
Eviction Policy
When a model slot is full: 1. The least recently used model of that type is evicted 2. The new model is loaded 3. If loading fails (except file-not-found errors), all models are evicted and the load is retried
Models currently processing inference requests cannot be evicted until they finish.
Per-Model Settings
Each model can be loaded with custom settings (context size, llamacpp backend, llamacpp args) via the /api/v1/load endpoint. These per-model settings override the default values set via CLI arguments or environment variables. See the /api/v1/load endpoint documentation for details.
Setting Priority Order:
1. Values passed explicitly in /api/v1/load request (highest priority)
2. Values from environment variables or server startup arguments (see Server Configuration)
3. Hardcoded defaults in lemond (lowest priority)
Start the HTTP Server
NOTE: This server is intended for use on local systems only. Do not expose the server port to the open internet.
Lemonade Server starts automatically with the OS after installation. See the Getting Started instructions. For server configuration options, see Server Configuration.
OpenAI-Compatible Endpoints
POST /api/v1/chat/completions 
Chat Completions API. You provide a list of messages and receive a completion. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
messages |
Yes | Array of messages in the conversation. Each message should have a role ("user" or "assistant") and content (the message text). |
|
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
tools |
No | A list of tools the model may call. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_completion_tokens. This value is now deprecated by OpenAI in favor of max_completion_tokens |
|
max_completion_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_tokens. |
Example request
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/chat/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Qwen3-0.6B-GGUF",
"messages": [
{
"role": "user",
"content": "What is the population of Paris?"
}
],
"stream": false
}'
curl -X POST http://localhost:13305/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-0.6B-GGUF",
"messages": [
{"role": "user", "content": "What is the population of Paris?"}
],
"stream": false
}'
Image understanding input format (OpenAI-compatible)
To send images to chat/completions, pass a messages[*].content array that mixes text and image_url items. The image can be provided as a base64 data URL (for example, from FileReader.readAsDataURL(...) in web apps).
curl -X POST http://localhost:13305/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."}}
]
}
],
"stream": false
}'
Response format
{
"id": "0",
"object": "chat.completion",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Paris has a population of approximately 2.2 million people in the city proper."
},
"finish_reason": "stop"
}]
}
For streaming responses, the API returns a stream of server-sent events (however, Open AI recommends using their streaming libraries for parsing streaming responses):
{
"id": "0",
"object": "chat.completion.chunk",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"delta": {
"role": "assistant",
"content": "Paris"
}
}]
}
POST /api/v1/completions 
Text Completions API. You provide a prompt and receive a completion. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The prompt to use for the completion. | |
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
echo |
No | Echo back the prompt in addition to the completion. Available on non-streaming mode. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Only available when stream=False. |
|
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion, including input tokens. |
Example request
Invoke-WebRequest -Uri "http://localhost:13305/api/v1/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "What is the population of Paris?",
"stream": false
}'
curl -X POST http://localhost:13305/api/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "What is the population of Paris?",
"stream": false
}'
Response format
The following format is used for both streaming and non-streaming responses:
{
"id": "0",
"object": "text_completion",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"text": "Paris has a population of approximately 2.2 million people in the city proper.",
"finish_reason": "stop"
}],
}
POST /api/v1/embeddings 
Embeddings API. You provide input text and receive vector representations (embeddings) that can be used for semantic search, clustering, and similarity comparisons. This API will also load the model if it is not already loaded.
Note: This endpoint is only available for models using the
llamacpporflmrecipes. ONNX models (OGA recipes) do not support embeddings.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | The input text or array of texts to embed. Can be a string or an array of strings. | |
model |
Yes | The model to use for generating embeddings. | |
encoding_format |
No | The format to return embeddings in. Supported values: "float" (default), "base64". |
Example request
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/embeddings" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
curl -X POST http://localhost:13305/api/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
Response format
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, ...]
},
{
"object": "embedding",
"index": 1,
"embedding": [0.0456, -0.0678, 0.1234, ...]
}
],
"model": "nomic-embed-text-v1-GGUF",
"usage": {
"prompt_tokens": 12,
"total_tokens": 12
}
}
Field Descriptions:
object- Type of response object, always"list"data- Array of embedding objectsobject- Type of embedding object, always"embedding"index- Index position of the input text in the requestembedding- Vector representation as an array of floatsmodel- Model identifier used to generate the embeddingsusage- Token usage statisticsprompt_tokens- Number of tokens in the inputtotal_tokens- Total tokens processed
POST /api/v1/reranking 
Reranking API. You provide a query and a list of documents, and receive the documents reordered by their relevance to the query with relevance scores. This is useful for improving search results quality. This API will also load the model if it is not already loaded.
Note: This endpoint follows API conventions similar to OpenAI's format but is not part of the official OpenAI API. It is inspired by llama.cpp and other inference server implementations.
Note: This endpoint is only available for models using the
llamacpprecipe. It is not available for FLM or ONNX models.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
query |
Yes | The search query text. | |
documents |
Yes | Array of document strings to be reranked. | |
model |
Yes | The model to use for reranking. |
Example request
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/reranking" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}'
curl -X POST http://localhost:13305/api/v1/reranking \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}'
Response format
{
"model": "bge-reranker-v2-m3-GGUF",
"object": "list",
"results": [
{
"index": 0,
"relevance_score": 8.60673713684082
},
{
"index": 1,
"relevance_score": -5.3886260986328125
},
{
"index": 2,
"relevance_score": -3.555561065673828
}
],
"usage": {
"prompt_tokens": 51,
"total_tokens": 51
}
}
Field Descriptions:
model- Model identifier used for rerankingobject- Type of response object, always"list"results- Array of all documents with relevance scoresindex- Original index of the document in the input arrayrelevance_score- Relevance score assigned by the model (higher = more relevant)usage- Token usage statisticsprompt_tokens- Number of tokens in the inputtotal_tokens- Total tokens processed
Note: The results are returned in their original input order, not sorted by relevance score. To get documents ranked by relevance, you need to sort the results by
relevance_scorein descending order on the client side.
POST /api/v1/responses 
Responses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | A list of dictionaries or a string input for the model to respond to. | |
model |
Yes | The model to use for the response. | |
max_output_tokens |
No | The maximum number of output tokens to generate. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. |
Streaming Events
The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:
response.createdresponse.output_text.deltaresponse.completed
For a full list of event types, see the API reference for streaming.
Example request
Invoke-WebRequest -Uri "http://localhost:13305/api/v1/responses" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
curl -X POST http://localhost:13305/api/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
Response format
{
"id": "0",
"created_at": 1746225832.0,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"object": "response",
"output": [{
"id": "0",
"content": [{
"annotations": [],
"text": "Paris has a population of approximately 2.2 million people in the city proper."
}]
}]
}
For streaming responses, the API returns a series of events. Refer to OpenAI streaming guide for details.
POST /api/v1/audio/transcriptions 
Audio Transcription API. You provide an audio file and receive a text transcription. This API will also load the model if it is not already loaded.
Note: This endpoint uses whisper.cpp as the backend. Whisper models are automatically downloaded when first used.
Limitations: Only
wavaudio format andjsonresponse format are currently supported.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
file |
Yes | The audio file to transcribe. Supported formats: wav. | |
model |
Yes | The Whisper model to use for transcription (e.g., Whisper-Tiny, Whisper-Base, Whisper-Small). |
|
language |
No | The language of the audio (ISO 639-1 code, e.g., en, es, fr). If not specified, Whisper will auto-detect the language. |
|
response_format |
No | The format of the response. Currently only json is supported. |
Example request
curl -X POST http://localhost:13305/api/v1/audio/transcriptions ^
-F "file=@C:\path\to\audio.wav" ^
-F "model=Whisper-Tiny"
curl -X POST http://localhost:13305/api/v1/audio/transcriptions \
-F "file=@/path/to/audio.wav" \
-F "model=Whisper-Tiny"
Response format
{
"text": "Hello, this is a sample transcription of the audio file."
}
Field Descriptions:
text- The transcribed text from the audio file
WS /realtime 
Realtime Audio Transcription API via WebSocket (OpenAI SDK compatible). Stream audio from a microphone and receive transcriptions in real-time with Voice Activity Detection (VAD).
Limitations: Only 16kHz mono PCM16 audio format is supported. Uses the same Whisper models as the HTTP transcription endpoint.
Connection
The WebSocket server runs on a dynamically assigned port. Discover the port via the /api/v1/health endpoint (websocket_port field), then connect with the model name:
ws://localhost:<websocket_port>/realtime?model=Whisper-Tiny
Upon connection, the server sends a session.created message with a session ID.
Client → Server Messages
| Message Type | Description |
|---|---|
session.update |
Configure the session (set model, VAD settings) |
input_audio_buffer.append |
Send audio data (base64-encoded PCM16) |
input_audio_buffer.commit |
Force transcription of buffered audio |
input_audio_buffer.clear |
Clear audio buffer without transcribing |
Server → Client Messages
| Message Type | Description |
|---|---|
session.created |
Session established, contains session ID |
session.updated |
Session configuration updated |
input_audio_buffer.speech_started |
VAD detected speech start |
input_audio_buffer.speech_stopped |
VAD detected speech end, transcription triggered |
input_audio_buffer.committed |
Audio buffer committed for transcription |
input_audio_buffer.cleared |
Audio buffer cleared |
conversation.item.input_audio_transcription.delta |
Interim/partial transcription (replaceable) |
conversation.item.input_audio_transcription.completed |
Final transcription result |
error |
Error message |
Example: Configure Session
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny"
}
}
Example: Send Audio
{
"type": "input_audio_buffer.append",
"audio": "<base64-encoded PCM16 audio>"
}
Audio should be: - 16kHz sample rate - Mono (single channel) - 16-bit signed integer (PCM16) - Base64 encoded - Sent in chunks (~85ms recommended)
Example: Transcription Result
{
"type": "conversation.item.input_audio_transcription.completed",
"transcript": "Hello, this is a test transcription."
}
VAD Configuration
VAD settings can be configured via session.update:
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny",
"turn_detection": {
"threshold": 0.01,
"silence_duration_ms": 800,
"prefix_padding_ms": 250
}
}
}
| Parameter | Default | Description |
|---|---|---|
threshold |
0.01 | RMS energy threshold for speech detection |
silence_duration_ms |
800 | Silence duration to trigger speech end |
prefix_padding_ms |
250 | Minimum speech duration before triggering |
Code Examples
See the examples/ directory for a complete, runnable example:
realtime_transcription.py- Python CLI for microphone streaming
# Stream from microphone
python examples/realtime_transcription.py --model Whisper-Tiny
Integration Notes
- Audio Format: Server expects 16kHz mono PCM16. Higher sample rates must be downsampled client-side.
- Chunk Size: Send audio in ~85-256ms chunks for optimal latency/efficiency.
- VAD Behavior: Server automatically detects speech boundaries and triggers transcription on speech end.
- Manual Commit: Use
input_audio_buffer.committo force transcription (e.g., when user clicks "stop"). - Clear Buffer: Use
input_audio_buffer.clearto discard audio without transcribing. - Chunking: We are still tuning the chunking to balance latency vs. accuracy.
Log Streaming API (WebSocket) 
Stream server logs over WebSocket. Clients connect, send a subscribe message, and receive a snapshot of recent log history followed by live log entries as they occur.
Connection
The WebSocket server shares the same port as the Realtime Audio Transcription API. Discover the port via the /api/v1/health endpoint (websocket_port field), then connect:
ws://localhost:<websocket_port>/logs/stream
After connecting, send a logs.subscribe message to start receiving logs.
Client → Server Messages
| Message Type | Description |
|---|---|
logs.subscribe |
Subscribe to log stream. Optional after_seq field to resume from a specific sequence number. |
Server → Client Messages
| Message Type | Description |
|---|---|
logs.snapshot |
Initial batch of retained log entries (up to 5000). Sent once after subscribing. |
logs.entry |
A single live log entry. Sent as new log lines are emitted. |
error |
Error message (e.g., invalid subscribe request). |
Example: Subscribe to Logs
Subscribe from the beginning (full backlog):
{
"type": "logs.subscribe",
"after_seq": null
}
Resume after a known sequence number (e.g., on reconnect):
{
"type": "logs.subscribe",
"after_seq": 1042
}
Example: Snapshot Response
{
"type": "logs.snapshot",
"entries": [
{
"seq": 1,
"timestamp": "2025-03-30 14:22:01.123",
"severity": "Info",
"tag": "Server",
"line": "2025-03-30 14:22:01.123 [Info] (Server) Starting Lemonade Server..."
}
]
}
Example: Live Entry
{
"type": "logs.entry",
"entry": {
"seq": 1043,
"timestamp": "2025-03-30 14:22:05.456",
"severity": "Info",
"tag": "Router",
"line": "2025-03-30 14:22:05.456 [Info] (Router) Model loaded successfully"
}
}
Log Entry Fields
| Field | Type | Description |
|---|---|---|
seq |
integer | Monotonically increasing sequence number. Use for dedup and resume. |
timestamp |
string | Formatted timestamp from the log system. |
severity |
string | Log level: Trace, Debug, Info, Warning, Error, Fatal. |
tag |
string | Log source tag (e.g., Server, Router, component name). |
line |
string | The full formatted log line. |
Integration Notes
- Reconnection: Track the last
seqreceived and pass it asafter_seqon reconnect to avoid duplicate entries. - Backlog: The server retains up to 5000 recent log entries. The snapshot may be smaller if fewer entries exist.
- Platform availability: WebSocket log streaming is available on all platforms (Windows, Linux, and macOS).
POST /api/v1/images/generations 
Image Generation API. You provide a text prompt and receive a generated image. This API uses stable-diffusion.cpp as the backend.
Note: Image generation uses Stable Diffusion models. Available models include
SD-Turbo(fast, ~4 steps),SDXL-Turbo,SD-1.5, andSDXL-Base-1.0.Performance: CPU inference takes ~4-5 minutes per image. GPU (Vulkan) is faster but may have compatibility issues with some hardware.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The text description of the image to generate. | |
model |
Yes | The Stable Diffusion model to use (e.g., SD-Turbo, SDXL-Turbo). |
|
size |
No | The size of the generated image. Format: WIDTHxHEIGHT (e.g., 512x512, 256x256). Default: 512x512. |
|
n |
No | Number of images to generate. Currently only 1 is supported. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
steps |
No | Number of inference steps. SD-Turbo works well with 4 steps. Default varies by model. | |
cfg_scale |
No | Classifier-free guidance scale. SD-Turbo uses low values (~1.0). Default varies by model. | |
seed |
No | Random seed for reproducibility. If not specified, a random seed is used. |
Example request
curl -X POST http://localhost:13305/api/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}'
POST /api/v1/images/edits 
Image Editing API. You provide a source image and a text prompt describing the desired change, and receive an edited image. This API uses stable-diffusion.cpp as the backend.
Note: This endpoint accepts
multipart/form-datarequests (not JSON). Use editing-capable models such asFlux-2-Klein-4BorSD-Turbo.Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
model |
Yes | The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). |
|
image |
Yes | The source image file to edit (PNG). Sent as a file in multipart/form-data. | |
prompt |
Yes | A text description of the desired edit. | |
mask |
No | An optional mask image (PNG). White areas indicate regions to edit; black areas are preserved. | |
size |
No | The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. |
|
n |
No | Number of images to generate. Allowed range: 1–10. Default: 1. Values outside this range are rejected with 400 Bad Request. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
steps |
No | Number of inference steps. Default varies by model. | |
cfg_scale |
No | Classifier-free guidance scale. Default varies by model. | |
seed |
No | Random seed for reproducibility. | |
user |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
background |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
quality |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
input_fidelity |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
output_compression |
No | OpenAI API compatibility field. Accepted; silently ignored by the backend. |
Example request
curl -X POST http://localhost:13305/api/v1/images/edits \
-F "model=Flux-2-Klein-4B" \
-F "prompt=Add a red barn and mountains in the background, photorealistic" \
-F "size=512x512" \
-F "n=1" \
-F "response_format=b64_json" \
-F "image=@/path/to/source_image.png"
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
response = client.images.edit(
model="Flux-2-Klein-4B",
image=image_file,
prompt="Add a red barn and mountains in the background, photorealistic",
size="512x512",
)
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("edited_image.png", "wb").write(image_data)
POST /api/v1/images/variations 
Image Variations API. You provide a source image and receive a variation of it. This API uses stable-diffusion.cpp as the backend.
Note: This endpoint accepts
multipart/form-datarequests (not JSON). Unlike/images/edits, apromptparameter is not supported and will be ignored — the model generates a variation based solely on the input image.Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
model |
Yes | The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). |
|
image |
Yes | The source image file (PNG). Sent as a file in multipart/form-data. | |
size |
No | The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. |
|
n |
No | Number of variations to generate. Integer between 1 and 10 inclusive. Default: 1. Values outside this range result in a 400 Bad Request error. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
user |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. |
Example request
curl -X POST http://localhost:13305/api/v1/images/variations \
-F "model=Flux-2-Klein-4B" \
-F "size=512x512" \
-F "n=1" \
-F "response_format=b64_json" \
-F "image=@/path/to/source_image.png"
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
response = client.images.create_variation(
model="Flux-2-Klein-4B",
image=image_file,
size="512x512",
n=1,
)
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("variation.png", "wb").write(image_data)
POST /api/v1/images/upscale 
Image Upscaling API. You provide a base64-encoded image and a Real-ESRGAN model name, and receive a 4x upscaled image. This API uses the sd-cli binary from stable-diffusion.cpp to perform super-resolution.
Note: Available upscale models are
RealESRGAN-x4plus(general-purpose, 64 MB) andRealESRGAN-x4plus-anime(optimized for anime-style art, 17 MB). Both produce a 4x resolution increase (e.g., 256x256 → 1024x1024).Note: Unlike
/images/editsand/images/variations, this endpoint accepts a JSON body (not multipart/form-data). The image must be provided as a base64-encoded string.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
image |
Yes | Base64-encoded PNG image to upscale. | |
model |
Yes | The ESRGAN model to use (e.g., RealESRGAN-x4plus, RealESRGAN-x4plus-anime). |
Example request
A typical workflow is to generate an image first, then upscale it:
# Step 1: Generate an image and save the base64 response
RESPONSE=$(curl -s -X POST http://localhost:13305/api/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}')
# Step 2: Build the upscale JSON payload and pipe it to curl via stdin
# (base64 images are too large for command-line interpolation)
echo "$RESPONSE" | python3 -c "
import sys, json
b64 = json.load(sys.stdin)['data'][0]['b64_json']
print(json.dumps({'image': b64, 'model': 'RealESRGAN-x4plus'}))
" | curl -X POST http://localhost:13305/api/v1/images/upscale \
-H "Content-Type: application/json" \
-d @-
# Step 1: Generate an image
$genResponse = Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/images/generations" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}'
# Step 2: Extract the base64 image
$imageB64 = ($genResponse.Content | ConvertFrom-Json).data[0].b64_json
# Step 3: Upscale the image with Real-ESRGAN
$body = @{ image = $imageB64; model = "RealESRGAN-x4plus" } | ConvertTo-Json
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/images/upscale" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body $body
import requests
import base64
BASE_URL = "http://localhost:13305/api/v1"
# Step 1: Generate an image
gen_response = requests.post(f"{BASE_URL}/images/generations", json={
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json",
})
image_b64 = gen_response.json()["data"][0]["b64_json"]
# Step 2: Upscale the image with Real-ESRGAN (512x512 -> 2048x2048)
upscale_response = requests.post(f"{BASE_URL}/images/upscale", json={
"image": image_b64,
"model": "RealESRGAN-x4plus",
})
# Step 3: Save the upscaled image to a file
upscaled_b64 = upscale_response.json()["data"][0]["b64_json"]
with open("upscaled.png", "wb") as f:
f.write(base64.b64decode(upscaled_b64))
Response format
{
"created": 1742927481,
"data": [
{
"b64_json": "<base64-encoded upscaled PNG>"
}
]
}
Field Descriptions:
created- Unix timestamp of when the upscaled image was generateddata- Array containing the upscaled imageb64_json- Base64-encoded PNG of the upscaled image
Error responses
| Status Code | Condition | Example |
|---|---|---|
| 400 | Missing image field |
{"error": {"message": "Missing 'image' field (base64 encoded)", "type": "invalid_request_error"}} |
| 400 | Missing model field |
{"error": {"message": "Missing 'model' field", "type": "invalid_request_error"}} |
| 404 | Unknown model name | {"error": {"message": "Upscale model not found: bad-model", "type": "invalid_request_error"}} |
| 500 | Upscale failed | {"error": {"message": "ESRGAN upscale failed", "type": "server_error"}} |
POST /api/v1/audio/speech 
Speech Generation API. You provide a text input and receive an audio file. This API uses Kokoros as the backend.
Note: The model to use is called
kokoro-v1. No other model is supported at the moment.Limitations: Only
mp3,wav,opus, andpcmare supported. Streaming is supported inaudio(pcm) mode.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | The text to speak. | |
model |
Yes | The model to use (e.g., kokoro-v1). |
|
speed |
No | Speaking speed. Default: 1.0. |
|
voice |
No | The voice to use. All OpenAI-defined voices can be used (alloy, ash, ...), as well as those defined by the kokoro model (af_sky, am_echo, ...). Default: shimmer |
|
response_format |
No | Format of the response. mp3, wav, opus, and pcm are supported. Default: mp3 |
|
stream_format |
No | If set, the response will be streamed. Only audio is supported, which will output pcm audio. Default: not set |
|
| #### Example request |
curl -X POST http://localhost:13305/api/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-v1",
"input": "Lemonade can speak!",
"speed": 1.0,
"steps": 4,
"response_format": "mp3"
}'
Response format
The generated audio file is returned as-is.
GET /api/v1/models 
Returns a list of models available on the server in an OpenAI-compatible format. Each model object includes extended fields like checkpoint, recipe, size, downloaded, and labels.
By default, only models available locally (downloaded) are shown, matching OpenAI API behavior.
Parameters
| Parameter | Required | Description |
|---|---|---|
show_all |
No | If set to true, returns all models from the catalog including those not yet downloaded. Defaults to false. |
Example request
# Show only downloaded models (OpenAI-compatible)
curl http://localhost:13305/api/v1/models
# Show all models including not-yet-downloaded (extended usage)
curl http://localhost:13305/api/v1/models?show_all=true
Response format
{
"object": "list",
"data": [
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"]
},
{
"id": "Gemma-3-4b-it-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
"recipe": "llamacpp",
"size": 3.61,
"downloaded": true,
"suggested": true,
"labels": ["hot", "vision"]
},
{
"id": "SD-Turbo",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "stabilityai/sd-turbo:sd_turbo.safetensors",
"recipe": "sd-cpp",
"size": 5.2,
"downloaded": true,
"suggested": true,
"labels": ["image"],
"image_defaults": {
"steps": 4,
"cfg_scale": 1.0,
"width": 512,
"height": 512
}
}
]
}
Field Descriptions:
object- Type of response object, always"list"data- Array of model objects with the following fields:id- Model identifier (used for loading and inference requests)created- Unix timestamp of when the model entry was createdobject- Type of object, always"model"owned_by- Owner of the model, always"lemonade"checkpoint- Full checkpoint identifier on Hugging Facerecipe- Backend/device recipe used to load the model (e.g.,"ryzenai-llm","llamacpp","flm")size- Model size in GB (omitted for models without size information)downloaded- Boolean indicating if the model is downloaded and available locallysuggested- Boolean indicating if the model is recommended for general uselabels- Array of tags describing the model (e.g.,"hot","reasoning","vision","embeddings","reranking","coding","tool-calling","image")image_defaults- (Image models only) Default generation parameters for the model:steps- Number of inference steps (e.g., 4 for turbo models, 20 for standard models)cfg_scale- Classifier-free guidance scale (e.g., 1.0 for turbo models, 7.5 for standard models)width- Default image width in pixelsheight- Default image height in pixels
GET /api/v1/models/{model_id} 
Retrieve a specific model by its ID. Returns the same model object format as the list endpoint above.
Parameters
| Parameter | Required | Description |
|---|---|---|
model_id |
Yes | The ID of the model to retrieve. Must match one of the model IDs from the models list. |
Example request
curl http://localhost:13305/api/v1/models/Qwen3-0.6B-GGUF
Response format
Returns a single model object with the same fields as described in the models list endpoint above.
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"],
"recipe_options" {
"ctx_size": 8192,
"llamacpp_args": "--no-mmap",
"llamacpp_backend": "rocm"
}
}
Error responses
If the model is not found, the endpoint returns a 404 error:
{
"error": {
"message": "Model Qwen3-0.6B-GGUF has not been found",
"type": "not_found"
}
}
Additional Endpoints
POST /api/v1/pull 
Register and install models for use with Lemonade Server.
Parameters
The Lemonade Server built-in model registry has a collection of model names that can be pulled and loaded. The pull endpoint can install any registered model, and it can also register-then-install any model available on Hugging Face.
Common Parameters
| Parameter | Required | Description |
|---|---|---|
stream |
No | If true, returns Server-Sent Events (SSE) with download progress. Defaults to false. |
Install a Model that is Already Registered
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to install. |
Example request:
curl -X POST http://localhost:13305/api/v1/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Installed model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
Register and Install a Model
Registration will place an entry for that model in the user_models.json file, which is located in the user's Lemonade cache (default: ~/.cache/lemonade). Then, the model will be installed. Once the model is registered and installed, it will show up in the models endpoint alongside the built-in models and can be loaded.
The recipe field defines which software framework and device will be used to load and run the model. For more information on OGA and Hugging Face recipes, see the Lemonade API README. For information on GGUF recipes, see llamacpp.
Note: the
model_namefor registering a new model must use theusernamespace, to prevent collisions with built-in models. For example,user.Phi-4-Mini-GGUF.
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Namespaced Lemonade Server model name to register and install. |
checkpoint |
Yes | HuggingFace checkpoint to install. |
recipe |
Yes | Lemonade API recipe to load the model with. |
reasoning |
No | Whether the model is a reasoning model, like DeepSeek (default: false). Adds 'reasoning' label. |
vision |
No | Whether the model has vision capabilities for processing images (default: false). Adds 'vision' label. |
embedding |
No | Whether the model is an embedding model (default: false). Adds 'embeddings' label. |
reranking |
No | Whether the model is a reranking model (default: false). Adds 'reranking' label. |
mmproj |
No | Multimodal Projector (mmproj) file to use for vision models. |
Example request:
curl -X POST http://localhost:13305/api/v1/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "user.Phi-4-Mini-GGUF",
"checkpoint": "unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M",
"recipe": "llamacpp"
}'
Response format:
{
"status":"success",
"message":"Installed model: user.Phi-4-Mini-GGUF"
}
In case of an error, the status will be error and the message will contain the error message.
Streaming Response (stream=true)
When stream=true, the endpoint returns Server-Sent Events with real-time download progress:
event: progress
data: {"file":"model.gguf","file_index":1,"total_files":2,"bytes_downloaded":1073741824,"bytes_total":2684354560,"percent":40}
event: progress
data: {"file":"config.json","file_index":2,"total_files":2,"bytes_downloaded":1024,"bytes_total":1024,"percent":100}
event: complete
data: {"file_index":2,"total_files":2,"percent":100}
Event Types:
| Event | Description |
|---|---|
progress |
Sent during download with current file and byte progress |
complete |
Sent when all files are downloaded successfully |
error |
Sent if download fails, with error field containing the message |
GET /api/v1/pull/variants 
Inspect a Hugging Face GGUF repository and enumerate the variants (quantizations and sharded folder groups) available for installation. Used by the lemonade pull <owner/repo> CLI flow and by the desktop app's model search to auto-populate the install form. The endpoint reads only public Hugging Face metadata; if the HF_TOKEN environment variable is set on the server, it is forwarded as a bearer token to access gated repositories.
Parameters
| Parameter | Required | Description |
|---|---|---|
checkpoint |
Yes | Hugging Face repo id, e.g. unsloth/Qwen3-8B-GGUF. Passed as a query string. |
Example request:
curl 'http://localhost:13305/api/v1/pull/variants?checkpoint=unsloth/Qwen3-8B-GGUF'
Response
{
"checkpoint": "unsloth/Qwen3-8B-GGUF",
"recipe": "llamacpp",
"suggested_name": "Qwen3-8B-GGUF",
"suggested_labels": ["vision"],
"mmproj_files": ["mmproj-model-f16.gguf"],
"variants": [
{
"name": "Q4_K_M",
"primary_file": "Qwen3-8B-Q4_K_M.gguf",
"files": ["Qwen3-8B-Q4_K_M.gguf"],
"sharded": false,
"size_bytes": 4920000000
},
{
"name": "Q8_0",
"primary_file": "Q8_0/Qwen3-8B-Q8_0-00001-of-00002.gguf",
"files": ["Q8_0/Qwen3-8B-Q8_0-00001-of-00002.gguf", "Q8_0/Qwen3-8B-Q8_0-00002-of-00002.gguf"],
"sharded": true,
"size_bytes": 8500000000
}
]
}
| Field | Description |
|---|---|
checkpoint |
Echoed input. |
recipe |
Suggested recipe (always llamacpp today; future expansion may return other values). |
suggested_name |
Repo id stripped of the owner/ prefix; suitable for use as the user.<name> model name. |
suggested_labels |
Inferred labels — vision if any mmproj-*.gguf files exist, plus embeddings/reranking if those substrings appear in the repo id. |
mmproj_files |
Bare filenames of mmproj-*.gguf files in the repo; the first one should be passed as mmproj to /api/v1/pull for vision models. |
variants[] |
Top quantizations for the repo, capped at 5. Each entry has name (e.g. Q4_K_M, UD-Q4_K_XL), primary_file, files, sharded, and size_bytes (from the HF ?blobs=true listing). Ranked by frequency of use in server_models.json (Q4_K_M, UD-Q4_K_XL, Q8_0, Q4_0 first, everything else sorted lexicographically). The CLI lemonade pull menu adds a free-text "Other" option for quants outside the top 5. |
Error responses
| Status | Cause |
|---|---|
| 400 | checkpoint query parameter missing or malformed (must contain /). |
| 404 | Hugging Face returned 404 for the checkpoint. |
| 500 | Other transport or parsing failures; the response body contains an error message. |
POST /api/v1/delete 
Delete a model by removing it from local storage. If the model is currently loaded, it will be unloaded first.
Parameters
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to delete. |
Example request:
curl -X POST http://localhost:13305/api/v1/delete \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Deleted model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
POST /api/v1/load 
Explicitly load a registered model into memory. This is useful to ensure that the model is loaded before you make a request. Installs the model if necessary.
Parameters
| Parameter | Required | Applies to | Description |
|---|---|---|---|
model_name |
Yes | All | Lemonade Server model name to load. |
save_options |
No | All | Boolean. If true, saves recipe options to recipe_options.json. Any previously stored value for model_name is replaced. |
ctx_size |
No | llamacpp, flm, ryzenai-llm | Context size for the model. Overrides the default value. |
llamacpp_backend |
No | llamacpp | LlamaCpp backend to use (vulkan, rocm, metal or cpu). |
llamacpp_args |
No | llamacpp | Custom arguments to pass to llama-server. The following are NOT allowed: -m, --port, --ctx-size, -ngl, --jinja, --mmproj, --embeddings, --reranking. |
whispercpp_backend |
No | whispercpp | WhisperCpp backend: npu or cpu on Windows; cpu or vulkan on Linux. Default is npu if supported. |
whispercpp_args |
No | whispercpp | Custom arguments to pass to whisper-server. The following are NOT allowed: -m, --model, --port. Example: --convert. |
steps |
No | sd-cpp | Number of inference steps for image generation. Default: 20. |
cfg_scale |
No | sd-cpp | Classifier-free guidance scale for image generation. Default: 7.0. |
width |
No | sd-cpp | Image width in pixels. Default: 512. |
height |
No | sd-cpp | Image height in pixels. Default: 512. |
Setting Priority:
When loading a model, settings are applied in this priority order:
1. Values explicitly passed in the load request (highest priority)
2. Per-model values configurable in recipe_options.json (see below for details)
3. Values from environment variables or server startup arguments (see Server Configuration)
4. Default hardcoded values in lemond (lowest priority)
Per-model options
You can configure recipe-specific options on a per-model basis. Lemonade manages a file called recipe_options.json in the user's Lemonade cache (default: ~/.cache/lemonade). The available options depend on the model's recipe:
{
"user.Qwen2.5-Coder-1.5B-Instruct": {
"ctx_size": 16384,
"llamacpp_backend": "vulkan",
"llamacpp_args": "-np 2 -kvu"
},
"Qwen3-Coder-30B-A3B-Instruct-GGUF" : {
"llamacpp_backend": "rocm"
},
"whisper-large-v3-turbo-q8_0.bin": {
"whispercpp_backend": "npu",
"whispercpp_args": "--convert"
}
}
Note that model names include any applicable prefix, such as user. and extra..
Example requests
Basic load:
curl -X POST http://localhost:13305/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Load with custom settings:
curl -X POST http://localhost:13305/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen3-0.6B-GGUF",
"ctx_size": 8192,
"llamacpp_backend": "rocm",
"llamacpp_args": "--flash-attn on --no-mmap"
}'
Load and save settings:
curl -X POST http://localhost:13305/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen3-0.6B-GGUF",
"ctx_size": 8192,
"llamacpp_backend": "vulkan",
"llamacpp_args": "--no-context-shift --no-mmap",
"save_options": true
}'
Load a Whisper model with NPU backend and conversion enabled:
curl -X POST http://localhost:13305/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "whisper-large-v3-turbo-q8_0.bin",
"whispercpp_backend": "npu",
"whispercpp_args": "--convert"
}'
Load an image generation model with custom settings:
curl -X POST http://localhost:13305/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "sd-turbo",
"steps": 4,
"cfg_scale": 1.0,
"width": 512,
"height": 512
}'
Response format
{
"status":"success",
"message":"Loaded model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
POST /api/v1/unload 
Explicitly unload a model from memory. This is useful to free up memory while still leaving the server process running (which takes minimal resources but a few seconds to start).
Parameters
| Parameter | Required | Description |
|---|---|---|
model_name |
No | Name of the specific model to unload. If not provided, all loaded models will be unloaded. |
Example requests
Unload a specific model:
curl -X POST http://localhost:13305/api/v1/unload \
-H "Content-Type: application/json" \
-d '{"model_name": "Qwen3-0.6B-GGUF"}'
Unload all models:
curl -X POST http://localhost:13305/api/v1/unload
Response format
Success response:
{
"status": "success",
"message": "Model unloaded successfully"
}
Error response (model not found):
{
"status": "error",
"message": "Model not found: Qwen3-0.6B-GGUF"
}
In case of an error, the status will be error and the message will contain the error message.
GET /api/v1/health 
Check the health of the server. This endpoint returns information about loaded models.
Parameters
This endpoint does not take any parameters.
Example request
curl http://localhost:13305/api/v1/health
Response format
{
"status": "ok",
"version":"9.3.3",
"websocket_port":9000,
"model_loaded": "Llama-3.2-1B-Instruct-Hybrid",
"all_models_loaded": [
{
"model_name": "Llama-3.2-1B-Instruct-Hybrid",
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"last_use": 1732123456.789,
"type": "llm",
"device": "gpu npu",
"recipe": "ryzenai-llm",
"recipe_options": {
"ctx_size": 4096
},
"backend_url": "http://127.0.0.1:8001/v1"
},
{
"model_name": "nomic-embed-text-v1-GGUF",
"checkpoint": "nomic-ai/nomic-embed-text-v1-GGUF:Q4_K_S",
"last_use": 1732123450.123,
"type": "embedding",
"device": "gpu",
"recipe": "llamacpp",
"recipe_options": {
"ctx_size": 8192,
"llamacpp_args": "--no-mmap",
"llamacpp_backend": "rocm"
},
"backend_url": "http://127.0.0.1:8002/v1"
}
],
"max_models": {
"audio":1,
"embedding":1,
"image":1,
"llm":1,
"reranking":1,
"tts":1
}
}
Field Descriptions:
status- Server health status, always"ok"version- Version number of Lemonade Servermodel_loaded- Model name of the most recently accessed modelall_models_loaded- Array of all currently loaded models with details:model_name- Name of the loaded modelcheckpoint- Full checkpoint identifierlast_use- Unix timestamp of last access (load or inference)type- Model type:"llm","embedding", or"reranking"device- Space-separated device list:"cpu","gpu","npu", or combinations like"gpu npu"backend_url- URL of the backend server process handling this model (useful for debugging)recipe: - Backend/device recipe used to load the model (e.g.,"ryzenai-llm","llamacpp","flm")recipe_options: - Options used to load the model (e.g.,"ctx_size","llamacpp_backend","llamacpp_args","whispercpp_args")max_models- Maximum number of models that can be loaded simultaneously per type (set viamax_loaded_modelsin Server Configuration):llm- Maximum LLM/chat modelsembedding- Maximum embedding modelsreranking- Maximum reranking modelsaudio- Maximum speech-to-text modelsimage- Maximum image modelstts- Maximum text-to-speech modelswebsocket_port- (optional) Port of the WebSocket server for the Realtime Audio Transcription API and Log Streaming API. Only present when the WebSocket server is running. The port is OS-assigned or set via--websocket-port.
GET /api/v1/stats 
Performance statistics from the last request.
Parameters
This endpoint does not take any parameters.
Example request
curl http://localhost:13305/api/v1/stats
Response format
{
"time_to_first_token": 2.14,
"tokens_per_second": 33.33,
"input_tokens": 128,
"output_tokens": 5,
"decode_token_times": [0.01, 0.02, 0.03, 0.04, 0.05],
"prompt_tokens": 9
}
Field Descriptions:
time_to_first_token- Time in seconds until the first token was generatedtokens_per_second- Generation speed in tokens per secondinput_tokens- Number of tokens processedoutput_tokens- Number of tokens generateddecode_token_times- Array of time taken for each generated tokenprompt_tokens- Total prompt tokens including cached tokens
GET /api/v1/system-info 
System information endpoint that provides complete hardware details and device enumeration.
Example request
curl "http://localhost:13305/api/v1/system-info"
Response format
{
"OS Version": "Windows-10-10.0.26100-SP0",
"Processor": "AMD Ryzen AI 9 HX 375 w/ Radeon 890M",
"Physical Memory": "32.0 GB",
"OEM System": "ASUS Zenbook S 16",
"BIOS Version": "1.0.0",
"CPU Max Clock": "5100 MHz",
"Windows Power Setting": "Balanced",
"devices": {
"cpu": {
"name": "AMD Ryzen AI 9 HX 375 w/ Radeon 890M",
"cores": 12,
"threads": 24,
"available": true,
"family": "x86_64"
},
"amd_igpu": {
"name": "AMD Radeon(TM) 890M Graphics",
"vram_gb": 0.5,
"available": true,
"family": "gfx1150"
},
"amd_dgpu": [],
"amd_npu": {
"name": "AMD Ryzen AI 9 HX 375 w/ Radeon 890M",
"power_mode": "Default",
"available": true,
"family": "XDNA2"
}
},
"recipes": {
"llamacpp": {
"default_backend": "vulkan",
"backends": {
"vulkan": {
"devices": ["cpu", "amd_igpu"],
"state": "installed",
"message": "",
"action": "",
"version": "b7869"
},
"rocm": {
"devices": ["amd_igpu"],
"state": "installable",
"message": "Backend is supported but not installed.",
"action": "lemonade backends install llamacpp:rocm"
},
"metal": {
"devices": [],
"state": "unsupported",
"message": "Requires macOS",
"action": ""
},
"cpu": {
"devices": ["cpu"],
"state": "update_required",
"message": "Backend update is required before use.",
"action": "lemonade backends install llamacpp:cpu"
}
}
},
"whispercpp": {
"default_backend": "default",
"backends": {
"default": {
"devices": ["cpu"],
"state": "installable",
"message": "Backend is supported but not installed.",
"action": "lemonade backends install whispercpp:default"
}
}
},
"sd-cpp": {
"default_backend": "default",
"backends": {
"default": {
"devices": ["cpu"],
"state": "installable",
"message": "Backend is supported but not installed.",
"action": "lemonade backends install sd-cpp:default"
}
}
},
"flm": {
"default_backend": "default",
"backends": {
"default": {
"devices": ["amd_npu"],
"state": "installed",
"message": "",
"action": "",
"version": "1.2.0"
}
}
},
"ryzenai-llm": {
"default_backend": "default",
"backends": {
"default": {
"devices": ["amd_npu"],
"state": "installed",
"message": "",
"action": ""
}
}
}
}
}
Field Descriptions:
- System fields:
OS Version- Operating system name and versionProcessor- CPU model namePhysical Memory- Total RAMOEM System- System/laptop model name (Windows only)BIOS Version- BIOS information (Windows only)CPU Max Clock- Maximum CPU clock speed (Windows only)-
Windows Power Setting- Current power plan (Windows only) -
devices- Hardware devices detected on the system (no software/support information) cpu- CPU information (name, cores, threads)amd_igpu- AMD integrated GPU (if present)amd_dgpu- Array of AMD discrete GPUs (if present)nvidia_dgpu- Array of NVIDIA discrete GPUs (if present)-
amd_npu- AMD NPU device (if present) -
recipes- Software recipes and their backend support status - Each recipe (e.g.,
llamacpp,whispercpp,flm) contains:default_backend- Preferred backend selected by server policy for this system (present when at least one backend is notunsupported)backends- Available backends for this recipe- Each backend contains:
devices- List of devices on this system that support this backend (empty if not supported)state- Backend lifecycle state:unsupported,installable,update_required, orinstalledmessage- Human-readable status text for GUI and CLI users. Required forunsupported,installable, andupdate_required; empty forinstalled.action- Actionable user instruction string. For install/update cases this is typically an exact CLI command; for other states it may be empty or another actionable value (for example, a URL).version- Installed or configured backend version (when available)
POST /api/v1/install 
Install or update a backend for a specific recipe/backend pair. If the backend is already installed but outdated, this endpoint updates it to the configured version.
Parameters
| Parameter | Required | Description |
|---|---|---|
recipe |
Yes | Recipe name (for example, llamacpp, flm, whispercpp, sd-cpp, ryzenai-llm) |
backend |
Yes | Backend name within the recipe (for example, vulkan, rocm, cpu, default) |
stream |
No | If true, returns Server-Sent Events with progress. Defaults to false. |
force |
No | If true, bypasses hardware filtering for unsupported backends and attempts installation anyway. Defaults to false. |
Example request
curl -X POST http://localhost:13305/api/v1/install \
-H "Content-Type: application/json" \
-d '{
"recipe": "llamacpp",
"backend": "vulkan",
"stream": false
}'
Response format
{
"status":"success",
"recipe":"llamacpp",
"backend":"vulkan"
}
In case of an error, returns an error field with details.
POST /api/v1/uninstall 
Uninstall a backend for a specific recipe/backend pair. If loaded models are using that backend, they are unloaded first.
Parameters
| Parameter | Required | Description |
|---|---|---|
recipe |
Yes | Recipe name |
backend |
Yes | Backend name |
Example request
curl -X POST http://localhost:13305/api/v1/uninstall \
-H "Content-Type: application/json" \
-d '{
"recipe": "llamacpp",
"backend": "vulkan"
}'
Response format
{
"status":"success",
"recipe":"llamacpp",
"backend":"vulkan"
}
In case of an error, returns an error field with details.
Debugging
To control logging verbosity, use lemonade config set log_level=debug (see Server Configuration).
Available levels:
- critical: Only critical errors that prevent server operation.
- error: Error conditions that might allow continued operation.
- warning: Warning conditions that should be addressed.
- info: (Default) General informational messages about server operation.
- debug: Detailed diagnostic information for troubleshooting, including metrics such as input/output token counts, Time To First Token (TTFT), and Tokens Per Second (TPS).
- trace: Very detailed tracing information, including everything from debug level plus all input prompts.
GGUF Support
The llama-server backend works with Lemonade's suggested *-GGUF models, as well as any .gguf model from Hugging Face. Windows and Ubuntu Linux are supported. Details:
- Lemonade Server wraps llama-server with support for the lemonade CLI, client web app, and endpoints (e.g., models, pull, load, etc.).
- The chat/completions, completions, embeddings, and reranking endpoints are supported.
- The embeddings endpoint requires embedding-specific models (e.g., nomic-embed-text models).
- The reranking endpoint requires reranker-specific models (e.g., bge-reranker models).
- responses is not supported at this time.
- A single Lemonade Server process can seamlessly switch between GGUF, ONNX, and FastFlowLM models.
- Lemonade Server will attempt to load models onto GPU with Vulkan first, and if that doesn't work it will fall back to CPU.
- From the end-user's perspective, OGA vs. GGUF should be completely transparent: they wont be aware of whether the built-in server or llama-server is serving their model.
Installing GGUF Models
To install an arbitrary GGUF from Hugging Face, open the Lemonade web app by navigating to http://localhost:13305 in your web browser, click the Model Management tab, and use the Add a Model form.
Platform Support Matrix
| Platform | GPU Acceleration | CPU Architecture |
|---|---|---|
| Windows | ✅ Vulkan, ROCm | ✅ x64 |
| Ubuntu | ✅ Vulkan, ROCm | ✅ x64 |
| Other Linux | ⚠️* Vulkan | ⚠️* x64 |
*Other Linux distributions may work but are not officially supported.
FastFlowLM Support
Similar to the llama-server support, Lemonade can also route OpenAI API requests to a FastFlowLM flm serve backend.
The flm serve backend works with Lemonade's suggested *-FLM models, as well as any model mentioned in flm list. Windows is the only supported operating system. Details:
- Lemonade Server wraps flm serve with support for the lemonade CLI, client web app, and all Lemonade custom endpoints (e.g., pull, load, etc.).
- OpenAI API endpoints supported: models, chat/completions (streaming), and embeddings.
- The embeddings endpoint requires embedding-specific models supported by FLM.
- A single Lemonade Server process can seamlessly switch between FLM, OGA, and GGUF models.
Installing FLM Models
To install an arbitrary FLM model:
1. flm list to view the supported models.
1. Open the Lemonade web app by navigating to http://localhost:13305 in your web browser, click the Model Management tab, and use the Add a Model form.
1. Use the model name from flm list as the "checkpoint name" in the Add a Model form and select "flm" as the recipe.