The Lemonade Server is a standards-compliant server process that provides an HTTP API to enable integration with other applications.
Lemonade Server currently supports these backends:
| Backend | Model Format | Description |
|---|---|---|
| Llama.cpp | .GGUF |
Uses llama.cpp’s llama-server backend. More details here. |
| ONNX Runtime GenAI (OGA) | .ONNX |
Uses Lemonade’s own ryzenai-server backend. |
| FastFlowLM | .q4nx |
Uses FLM’s flm serve backend. More details here. |
The key endpoints of the OpenAI API are available.
We are also actively investigating and developing additional endpoints that will improve the experience of local applications.
/api/v1/chat/completions - Chat Completions (messages -> completion)/api/v1/completions - Text Completions (prompt -> completion)/api/v1/embeddings - Embeddings (text -> vector representations)POST /api/v1/responses - Chat Completions (prompt |
messages -> event) |
/api/v1/models - List models available locally/api/v1/models/{model_id} - Retrieve a specific model by IDThese endpoints defined by llama.cpp extend the OpenAI-compatible API with additional functionality.
/api/v1/reranking - Reranking (query + documents -> relevance-scored documents)We have designed a set of Lemonade-specific endpoints to enable client applications by extending the existing cloud-focused APIs (e.g., OpenAI). These extensions allow for a greater degree of UI/UX responsiveness in native applications by allowing applications to:
The additional endpoints are:
/api/v1/pull - Install a model/api/v1/load - Load a model/api/v1/unload - Unload a model/api/v1/health - Check server health/api/v1/stats - Performance statistics from the last request/api/v1/system-info - System information and device enumerationLemonade Server supports loading multiple models simultaneously, allowing you to keep frequently-used models in memory for faster switching. The server uses a Least Recently Used (LRU) cache policy to automatically manage model eviction when limits are reached.
Use the --max-loaded-models option to specify how many models to keep loaded:
# Load up to 3 LLMs, 2 embedding models, and 1 reranking model
lemonade-server serve --max-loaded-models 3 2 1
# Load up to 5 LLMs (embeddings and reranking default to 1 each)
lemonade-server serve --max-loaded-models 5
Default: 1 1 1 (one model of each type)
Models are categorized into three types:
embeddings label)reranking label)Each type has its own independent limit and LRU cache.
When a model slot is full:
Models currently processing inference requests cannot be evicted until they finish.
Each model can be loaded with custom settings (context size, llamacpp backend, llamacpp args) via the /api/v1/load endpoint. These per-model settings override the default values set via CLI arguments or environment variables. See the /api/v1/load endpoint documentation for details.
Setting Priority Order:
/api/v1/load request (highest priority)lemonade-server CLI arguments or environment variableslemonade-router (lowest priority)NOTE: This server is intended for use on local systems only. Do not expose the server port to the open internet.
See the Lemonade Server getting started instructions.
lemonade-server serve
POST /api/v1/chat/completions Chat Completions API. You provide a list of messages and receive a completion. This API will also load the model if it is not already loaded.
| Parameter | Required | Description | Status |
|---|---|---|---|
messages |
Yes | Array of messages in the conversation. Each message should have a role (“user” or “assistant”) and content (the message text). |
|
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
tools |
No | A list of tools the model may call. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_completion_tokens. This value is now deprecated by OpenAI in favor of max_completion_tokens |
|
max_completion_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_tokens. |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:8000/api/v1/chat/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"messages": [
{
"role": "user",
"content": "What is the population of Paris?"
}
],
"stream": false
}'
``` === "Bash"
```bash
curl -X POST http://localhost:8000/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"messages": [
{"role": "user", "content": "What is the population of Paris?"}
],
"stream": false
}'
```
=== “Non-streaming responses”
```json
{
"id": "0",
"object": "chat.completion",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Paris has a population of approximately 2.2 million people in the city proper."
},
"finish_reason": "stop"
}]
}
``` === "Streaming responses"
For streaming responses, the API returns a stream of server-sent events (however, Open AI recommends using their streaming libraries for parsing streaming responses):
```json
{
"id": "0",
"object": "chat.completion.chunk",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"delta": {
"role": "assistant",
"content": "Paris"
}
}]
}
```
POST /api/v1/completions Text Completions API. You provide a prompt and receive a completion. This API will also load the model if it is not already loaded.
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The prompt to use for the completion. | |
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
echo |
No | Echo back the prompt in addition to the completion. Available on non-streaming mode. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Only available when stream=False. |
|
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion, including input tokens. |
=== “PowerShell”
```powershell
Invoke-WebRequest -Uri "http://localhost:8000/api/v1/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"prompt": "What is the population of Paris?",
"stream": false
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:8000/api/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"prompt": "What is the population of Paris?",
"stream": false
}'
```
The following format is used for both streaming and non-streaming responses:
{
"id": "0",
"object": "text_completion",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"text": "Paris has a population of approximately 2.2 million people in the city proper.",
"finish_reason": "stop"
}],
}
POST /api/v1/embeddings Embeddings API. You provide input text and receive vector representations (embeddings) that can be used for semantic search, clustering, and similarity comparisons. This API will also load the model if it is not already loaded.
Note: This endpoint is only available for models using the
llamacpporflmrecipes. ONNX models (OGA recipes) do not support embeddings.
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | The input text or array of texts to embed. Can be a string or an array of strings. | |
model |
Yes | The model to use for generating embeddings. | |
encoding_format |
No | The format to return embeddings in. Supported values: "float" (default), "base64". |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:8000/api/v1/embeddings" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:8000/api/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
```
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, ...]
},
{
"object": "embedding",
"index": 1,
"embedding": [0.0456, -0.0678, 0.1234, ...]
}
],
"model": "nomic-embed-text-v1-GGUF",
"usage": {
"prompt_tokens": 12,
"total_tokens": 12
}
}
Field Descriptions:
object - Type of response object, always "list"data - Array of embedding objects
object - Type of embedding object, always "embedding"index - Index position of the input text in the requestembedding - Vector representation as an array of floatsmodel - Model identifier used to generate the embeddingsusage - Token usage statistics
prompt_tokens - Number of tokens in the inputtotal_tokens - Total tokens processedPOST /api/v1/reranking Reranking API. You provide a query and a list of documents, and receive the documents reordered by their relevance to the query with relevance scores. This is useful for improving search results quality. This API will also load the model if it is not already loaded.
Note: This endpoint follows API conventions similar to OpenAI’s format but is not part of the official OpenAI API. It is inspired by llama.cpp and other inference server implementations.
Note: This endpoint is only available for models using the
llamacpprecipe. It is not available for FLM or ONNX models.
| Parameter | Required | Description | Status |
|---|---|---|---|
query |
Yes | The search query text. | |
documents |
Yes | Array of document strings to be reranked. | |
model |
Yes | The model to use for reranking. |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:8000/api/v1/reranking" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:8000/api/v1/reranking \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}'
```
{
"model": "bge-reranker-v2-m3-GGUF",
"object": "list",
"results": [
{
"index": 0,
"relevance_score": 8.60673713684082
},
{
"index": 1,
"relevance_score": -5.3886260986328125
},
{
"index": 2,
"relevance_score": -3.555561065673828
}
],
"usage": {
"prompt_tokens": 51,
"total_tokens": 51
}
}
Field Descriptions:
model - Model identifier used for rerankingobject - Type of response object, always "list"results - Array of all documents with relevance scores
index - Original index of the document in the input arrayrelevance_score - Relevance score assigned by the model (higher = more relevant)usage - Token usage statistics
prompt_tokens - Number of tokens in the inputtotal_tokens - Total tokens processedNote: The results are returned in their original input order, not sorted by relevance score. To get documents ranked by relevance, you need to sort the results by
relevance_scorein descending order on the client side.
POST /api/v1/responses Responses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | A list of dictionaries or a string input for the model to respond to. | |
model |
Yes | The model to use for the response. | |
max_output_tokens |
No | The maximum number of output tokens to generate. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. |
The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:
response.createdresponse.output_text.deltaresponse.completedFor a full list of event types, see the API reference for streaming.
=== “PowerShell”
```powershell
Invoke-WebRequest -Uri "http://localhost:8000/api/v1/responses" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:8000/api/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
```
=== “Non-streaming responses”
```json
{
"id": "0",
"created_at": 1746225832.0,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"object": "response",
"output": [{
"id": "0",
"content": [{
"annotations": [],
"text": "Paris has a population of approximately 2.2 million people in the city proper."
}]
}]
}
```
=== “Streaming Responses” For streaming responses, the API returns a series of events. Refer to OpenAI streaming guide for details.
GET /api/v1/models Returns a list of models available on the server in an OpenAI-compatible format. Each model object includes extended fields like checkpoint, recipe, size, downloaded, and labels.
By default, only models available locally (downloaded) are shown, matching OpenAI API behavior.
| Parameter | Required | Description |
|---|---|---|
show_all |
No | If set to true, returns all models from the catalog including those not yet downloaded. Defaults to false. |
# Show only downloaded models (OpenAI-compatible)
curl http://localhost:8000/api/v1/models
# Show all models including not-yet-downloaded (extended usage)
curl http://localhost:8000/api/v1/models?show_all=true
{
"object": "list",
"data": [
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"]
},
{
"id": "Gemma-3-4b-it-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
"recipe": "llamacpp",
"size": 3.61,
"downloaded": true,
"suggested": true,
"labels": ["hot", "vision"]
}
]
}
Field Descriptions:
object - Type of response object, always "list"data - Array of model objects with the following fields:
id - Model identifier (used for loading and inference requests)created - Unix timestamp of when the model entry was createdobject - Type of object, always "model"owned_by - Owner of the model, always "lemonade"checkpoint - Full checkpoint identifier on Hugging Facerecipe - Backend/device recipe used to load the model (e.g., "oga-cpu", "oga-hybrid", "llamacpp", "flm")size - Model size in GB (omitted for models without size information)downloaded - Boolean indicating if the model is downloaded and available locallysuggested - Boolean indicating if the model is recommended for general uselabels - Array of tags describing the model (e.g., "hot", "reasoning", "vision", "embeddings", "reranking", "coding", "tool-calling")GET /api/v1/models/{model_id} Retrieve a specific model by its ID. Returns the same model object format as the list endpoint above.
| Parameter | Required | Description |
|---|---|---|
model_id |
Yes | The ID of the model to retrieve. Must match one of the model IDs from the models list. |
curl http://localhost:8000/api/v1/models/Qwen3-0.6B-GGUF
Returns a single model object with the same fields as described in the models list endpoint above.
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"]
}
If the model is not found, the endpoint returns a 404 error:
{
"error": {
"message": "Model Llama-3.2-1B-Instruct-Hybrid has not been found",
"type": "not_found"
}
}
POST /api/v1/pull Register and install models for use with Lemonade Server.
The Lemonade Server built-in model registry has a collection of model names that can be pulled and loaded. The pull endpoint can install any registered model, and it can also register-then-install any model available on Hugging Face.
Common Parameters
| Parameter | Required | Description |
|---|---|---|
stream |
No | If true, returns Server-Sent Events (SSE) with download progress. Defaults to false. |
Install a Model that is Already Registered
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to install. |
Example request:
curl -X POST http://localhost:8000/api/v1/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Installed model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
Register and Install a Model
Registration will place an entry for that model in the user_models.json file, which is located in the user’s Lemonade cache (default: ~/.cache/lemonade). Then, the model will be installed. Once the model is registered and installed, it will show up in the models endpoint alongside the built-in models and can be loaded.
The recipe field defines which software framework and device will be used to load and run the model. For more information on OGA and Hugging Face recipes, see the Lemonade API README. For information on GGUF recipes, see llamacpp.
Note: the
model_namefor registering a new model must use theusernamespace, to prevent collisions with built-in models. For example,user.Phi-4-Mini-GGUF.
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Namespaced Lemonade Server model name to register and install. |
checkpoint |
Yes | HuggingFace checkpoint to install. |
recipe |
Yes | Lemonade API recipe to load the model with. |
reasoning |
No | Whether the model is a reasoning model, like DeepSeek (default: false). Adds ‘reasoning’ label. |
vision |
No | Whether the model has vision capabilities for processing images (default: false). Adds ‘vision’ label. |
embedding |
No | Whether the model is an embedding model (default: false). Adds ‘embeddings’ label. |
reranking |
No | Whether the model is a reranking model (default: false). Adds ‘reranking’ label. |
mmproj |
No | Multimodal Projector (mmproj) file to use for vision models. |
Example request:
curl -X POST http://localhost:8000/api/v1/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "user.Phi-4-Mini-GGUF",
"checkpoint": "unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M",
"recipe": "llamacpp"
}'
Response format:
{
"status":"success",
"message":"Installed model: user.Phi-4-Mini-GGUF"
}
In case of an error, the status will be error and the message will contain the error message.
When stream=true, the endpoint returns Server-Sent Events with real-time download progress:
event: progress
data: {"file":"model.gguf","file_index":1,"total_files":2,"bytes_downloaded":1073741824,"bytes_total":2684354560,"percent":40}
event: progress
data: {"file":"config.json","file_index":2,"total_files":2,"bytes_downloaded":1024,"bytes_total":1024,"percent":100}
event: complete
data: {"file_index":2,"total_files":2,"percent":100}
Event Types:
| Event | Description |
|---|---|
progress |
Sent during download with current file and byte progress |
complete |
Sent when all files are downloaded successfully |
error |
Sent if download fails, with error field containing the message |
POST /api/v1/delete Delete a model by removing it from local storage. If the model is currently loaded, it will be unloaded first.
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to delete. |
Example request:
curl -X POST http://localhost:8000/api/v1/delete \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Deleted model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
POST /api/v1/load Explicitly load a registered model into memory. This is useful to ensure that the model is loaded before you make a request. Installs the model if necessary.
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to load. |
ctx_size |
No | Context size for the model. Overrides the default value for this model. |
llamacpp_backend |
No | LlamaCpp backend to use (vulkan, rocm, or metal). Only applies to llamacpp models. Overrides the default value for this model. |
llamacpp_args |
No | Custom arguments to pass to llama-server. Must not conflict with arguments managed by Lemonade (e.g., -m, --port, --ctx-size, -ngl). Overrides the default value for this model. |
Setting Priority:
When loading a model, settings are applied in this priority order:
load request (highest priority)lemonade-server CLI arguments or environment variableslemonade-router (lowest priority)Basic load:
curl -X POST http://localhost:8000/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Load with custom settings:
curl -X POST http://localhost:8000/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "llama-3.2-3b-instruct-GGUF",
"ctx_size": 8192,
"llamacpp_backend": "rocm",
"llamacpp_args": "--flash-attn on --no-mmap"
}'
{
"status":"success",
"message":"Loaded model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
POST /api/v1/unload Explicitly unload a model from memory. This is useful to free up memory while still leaving the server process running (which takes minimal resources but a few seconds to start).
| Parameter | Required | Description |
|---|---|---|
model_name |
No | Name of the specific model to unload. If not provided, all loaded models will be unloaded. |
Unload a specific model:
curl -X POST http://localhost:8000/api/v1/unload \
-H "Content-Type: application/json" \
-d '{"model_name": "Llama-3.2-1B-Instruct-Hybrid"}'
Unload all models:
curl -X POST http://localhost:8000/api/v1/unload
Success response:
{
"status": "success",
"message": "Model unloaded successfully"
}
Error response (model not found):
{
"status": "error",
"message": "Model not found: Llama-3.2-1B-Instruct-Hybrid"
}
In case of an error, the status will be error and the message will contain the error message.
GET /api/v1/health Check the health of the server. This endpoint returns information about loaded models.
This endpoint does not take any parameters.
curl http://localhost:8000/api/v1/health
{
"status": "ok",
"checkpoint_loaded": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"model_loaded": "Llama-3.2-1B-Instruct-Hybrid",
"all_models_loaded": [
{
"model_name": "Llama-3.2-1B-Instruct-Hybrid",
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"last_use": 1732123456.789,
"type": "llm",
"device": "gpu npu",
"backend_url": "http://127.0.0.1:8001/v1"
},
{
"model_name": "nomic-embed-text-v1-GGUF",
"checkpoint": "nomic-ai/nomic-embed-text-v1-GGUF:Q4_K_S",
"last_use": 1732123450.123,
"type": "embedding",
"device": "gpu",
"backend_url": "http://127.0.0.1:8002/v1"
}
]
}
Field Descriptions:
status - Server health status, always "ok"checkpoint_loaded - Checkpoint identifier of the most recently accessed modelmodel_loaded - Model name of the most recently accessed modelall_models_loaded - Array of all currently loaded models with details:
model_name - Name of the loaded modelcheckpoint - Full checkpoint identifierlast_use - Unix timestamp of last access (load or inference)type - Model type: "llm", "embedding", or "reranking"device - Space-separated device list: "cpu", "gpu", "npu", or combinations like "gpu npu"backend_url - URL of the backend server process handling this model (useful for debugging)GET /api/v1/stats Performance statistics from the last request.
This endpoint does not take any parameters.
curl http://localhost:8000/api/v1/stats
{
"time_to_first_token": 2.14,
"tokens_per_second": 33.33,
"input_tokens": 128,
"output_tokens": 5,
"decode_token_times": [0.01, 0.02, 0.03, 0.04, 0.05],
"prompt_tokens": 9
}
Field Descriptions:
time_to_first_token - Time in seconds until the first token was generatedtokens_per_second - Generation speed in tokens per secondinput_tokens - Number of tokens processedoutput_tokens - Number of tokens generateddecode_token_times - Array of time taken for each generated tokenprompt_tokens - Total prompt tokens including cached tokensGET /api/v1/system-info System information endpoint that provides complete hardware details and device enumeration.
| Parameter | Required | Description | Status |
|---|---|---|---|
verbose |
No | Include detailed system information. When false (default), returns essential information (OS, processor, memory, devices). When true, includes additional details like Python packages and extended system information. |
=== “Basic system information”
```bash
curl "http://localhost:8000/api/v1/system-info"
```
=== “Detailed system information”
```bash
curl "http://localhost:8000/api/v1/system-info?verbose=true"
```
=== “Basic response (verbose=false)”
```json
{
"OS Version": "Windows-10-10.0.26100-SP0",
"Processor": "AMD Ryzen AI 9 HX 375 w/ Radeon 890M",
"Physical Memory": "32.0 GB",
"devices": {
"cpu": {
"name": "AMD Ryzen AI 9 HX 375 w/ Radeon 890M",
"cores": 12,
"threads": 24,
"available": true
},
"amd_igpu": {
"name": "AMD Radeon(TM) 890M Graphics",
"memory_mb": 512,
"driver_version": 32.0.12010.10001,
"available": true
},
"amd_dgpu": [],
"npu": {
"name": "AMD NPU",
"driver_version": "32.0.203.257",
"power_mode": "Default",
"available": true
}
}
}
```
To help debug the Lemonade server, you can use the --log-level parameter to control the verbosity of logging information. The server supports multiple logging levels that provide increasing amounts of detail about server operations.
lemonade-server serve --log-level [level]
Where [level] can be one of:
The llama-server backend works with Lemonade’s suggested *-GGUF models, as well as any .gguf model from Hugging Face. Windows, Ubuntu Linux, and macOS are supported. Details:
llama-server with support for the lemonade-server CLI, client web app, and endpoints (e.g., models, pull, load, etc.).
chat/completions, completions, embeddings, and reranking endpoints are supported.embeddings endpoint requires embedding-specific models (e.g., nomic-embed-text models).reranking endpoint requires reranker-specific models (e.g., bge-reranker models).responses is not supported at this time.llama-server is serving their model.To install an arbitrary GGUF from Hugging Face, open the Lemonade web app by navigating to http://localhost:8000 in your web browser, click the Model Management tab, and use the Add a Model form.
| Platform | GPU Acceleration | CPU Architecture |
|---|---|---|
| Windows | ✅ Vulkan, ROCm | ✅ x64 |
| Ubuntu | ✅ Vulkan, ROCm | ✅ x64 |
| macOS | ✅ Metal | ✅ Apple Silicon |
| Other Linux | ⚠️* Vulkan | ⚠️* x64 |
*Other Linux distributions may work but are not officially supported.
Similar to the llama-server support, Lemonade can also route OpenAI API requests to a FastFlowLM flm serve backend.
The flm serve backend works with Lemonade’s suggested *-FLM models, as well as any model mentioned in flm list. Windows is the only supported operating system. Details:
flm serve with support for the lemonade-server CLI, client web app, and all Lemonade custom endpoints (e.g., pull, load, etc.).
models, chat/completions (streaming), and embeddings.embeddings endpoint requires embedding-specific models supported by FLM.To install an arbitrary FLM model:
flm list to view the supported models.flm list as the “checkpoint name” in the Add a Model form and select “flm” as the recipe.