Lemonade Server Spec
The lemonade SDK provides a standards-compliant server process that provides a REST API to enable communication with other applications.
Lemonade Server currently supports two backends:
| Backend | Model Format | Description |
|---|---|---|
| ONNX Runtime GenAI (OGA) | .ONNX |
Lemonade's built-in server, recommended for standard use on AMD platforms. |
| Llama.cpp | .GGUF |
Uses llama.cpp's llama-server backend. More details here. |
| FastFlowLM | .q4nx |
Uses FLM's flm serve backend. More details here. |
OGA Endpoints Overview
Right now, the key endpoints of the OpenAI API are available.
We are also actively investigating and developing additional endpoints that will improve the experience of local applications.
OpenAI-Compatible Endpoints
- POST
/api/v1/chat/completions- Chat Completions (messages -> completion) - POST
/api/v1/completions- Text Completions (prompt -> completion) - POST
/api/v1/responses- Chat Completions (prompt|messages -> event) - GET
/api/v1/models- List models available locally - GET
/api/v1/models/{model_id}- Retrieve a specific model by ID
Additional Endpoints
🚧 These additional endpoints are a preview that is under active development. The API specification is subject to change.
These additional endpoints were inspired by the LM Studio REST API, Ollama API, and OpenAI API.
They focus on enabling client applications by extending existing cloud-focused APIs (e.g., OpenAI) to also include the ability to load and unload models before completion requests are made. These extensions allow for a greater degree of UI/UX responsiveness in native applications by allowing applications to:
- Pre-load models at UI-loading-time, as opposed to completion-request time.
- Load models from the local system that were downloaded by other applications (i.e., a common system-wide models cache).
- Unload models to save memory space.
The additional endpoints under development are:
- POST
/api/v1/pull- Install a model - POST
/api/v1/load- Load a model - POST
/api/v1/unload- Unload a model - POST
/api/v1/params- Set generation parameters - GET
/api/v1/health- Check server health - GET
/api/v1/stats- Performance statistics from the last request - GET
/api/v1/system-info- System information and device enumeration
🚧 We are in the process of developing this interface. Let us know what's important to you on Github or by email (lemonade at amd dot com).
Start the REST API Server
NOTE: This server is intended for use on local systems only. Do not expose the server port to the open internet.
Windows Installer
See the Lemonade Server getting started instructions.
Python Environment
If you have Lemonade installed in a Python environment, simply activate it and run the following command to start the server:
lemonade-server-dev serve
OpenAI-Compatible Endpoints
POST /api/v1/chat/completions 
Chat Completions API. You provide a list of messages and receive a completion. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
messages |
Yes | Array of messages in the conversation. Each message should have a role ("user" or "assistant") and content (the message text). |
|
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
tools |
No | A list of tools the model may call. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_completion_tokens. This value is now deprecated by OpenAI in favor of max_completion_tokens |
|
max_completion_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_tokens. |
Note: The value for
modelis either a Lemonade Server model name, or a checkpoint that has been pre-loaded using the load endpoint.
Example request
Invoke-WebRequest `
-Uri "http://localhost:8000/api/v1/chat/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"messages": [
{
"role": "user",
"content": "What is the population of Paris?"
}
],
"stream": false
}'
curl -X POST http://localhost:8000/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"messages": [
{"role": "user", "content": "What is the population of Paris?"}
],
"stream": false
}'
Response format
{
"id": "0",
"object": "chat.completion",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Paris has a population of approximately 2.2 million people in the city proper."
},
"finish_reason": "stop"
}]
}
For streaming responses, the API returns a stream of server-sent events (however, Open AI recommends using their streaming libraries for parsing streaming responses):
{
"id": "0",
"object": "chat.completion.chunk",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"delta": {
"role": "assistant",
"content": "Paris"
}
}]
}
POST /api/v1/completions 
Text Completions API. You provide a prompt and receive a completion. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The prompt to use for the completion. | |
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
echo |
No | Echo back the prompt in addition to the completion. Available on non-streaming mode. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Only available when stream=False. |
|
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion, including input tokens. |
Note: The value for
modelis either a Lemonade Server model name, or a checkpoint that has been pre-loaded using the load endpoint.
Example request
Invoke-WebRequest -Uri "http://localhost:8000/api/v1/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"prompt": "What is the population of Paris?",
"stream": false
}'
curl -X POST http://localhost:8000/api/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"prompt": "What is the population of Paris?",
"stream": false
}'
Response format
The following format is used for both streaming and non-streaming responses:
{
"id": "0",
"object": "text_completion",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"text": "Paris has a population of approximately 2.2 million people in the city proper.",
"finish_reason": "stop"
}],
}
POST /api/v1/responses 
Responses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | A list of dictionaries or a string input for the model to respond to. | |
model |
Yes | The model to use for the response. | |
max_output_tokens |
No | The maximum number of output tokens to generate. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. |
Note: The value for
modelis either a Lemonade Server model name, or a checkpoint that has been pre-loaded using the load endpoint.
Streaming Events
The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:
response.createdresponse.output_text.deltaresponse.completed
For a full list of event types, see the API reference for streaming.
Example request
Invoke-WebRequest -Uri "http://localhost:8000/api/v1/responses" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
curl -X POST http://localhost:8000/api/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
Response format
{
"id": "0",
"created_at": 1746225832.0,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"object": "response",
"output": [{
"id": "0",
"content": [{
"annotations": [],
"text": "Paris has a population of approximately 2.2 million people in the city proper."
}]
}]
}
For streaming responses, the API returns a series of events. Refer to OpenAI streaming guide for details.
GET /api/v1/models 
Returns a list of key models available on the server in an OpenAI-compatible format. We also expanded each model object with the checkpoint and recipe fields, which may be used to load a model using the load endpoint.
By default, only models available locally (downloaded) are shown, matching OpenAI API behavior.
Parameters
| Parameter | Required | Description |
|---|---|---|
show_all |
No | If set to true, returns all models from the catalog with additional fields (name, downloaded, labels). Used by the CLI list command. Defaults to false. |
Example request
# Show only downloaded models (OpenAI-compatible)
curl http://localhost:8000/api/v1/models
# Show all models with download status (CLI usage)
curl http://localhost:8000/api/v1/models?show_all=true
Response format
Default response (only downloaded models):
{
"object": "list",
"data": [
{
"id": "Qwen2.5-0.5B-Instruct-CPU",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx",
"recipe": "oga-cpu"
},
{
"id": "Llama-3.2-1B-Instruct-Hybrid",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"recipe": "oga-hybrid"
}
]
}
With show_all=true (includes all models with additional fields):
{
"object": "list",
"data": [
{
"id": "Qwen2.5-0.5B-Instruct-CPU",
"object": "model",
"created": 1744173590,
"owned_by": "lemonade",
"name": "Qwen2.5-0.5B-Instruct-CPU",
"checkpoint": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx",
"recipe": "oga-cpu",
"downloaded": true,
"labels": ["hot", "cpu"]
},
{
"id": "Llama-3.2-1B-Instruct-Hybrid",
"object": "model",
"created": 1744173590,
"owned_by": "lemonade",
"name": "Llama-3.2-1B-Instruct-Hybrid",
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"recipe": "oga-hybrid",
"downloaded": false,
"labels": ["hot", "hybrid"]
}
]
}
GET /api/v1/models/{model_id} 
Retrieve a specific model by its ID in an OpenAI-compatible format. Returns detailed information about a single model including the checkpoint and recipe fields.
Parameters
| Parameter | Required | Description |
|---|---|---|
model_id |
Yes | The ID of the model to retrieve. Must match one of the model IDs from the models list. |
Example request
curl http://localhost:8000/api/v1/models/Llama-3.2-1B-Instruct-Hybrid
Response format
{
"id": "Llama-3.2-1B-Instruct-Hybrid",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"recipe": "oga-hybrid"
}
Error responses
If the model is not found, the endpoint returns a 404 error:
{
"error": {
"message": "Model Llama-3.2-1B-Instruct-Hybrid has not been found",
"type": "not_found"
}
}
Additional Endpoints
POST /api/v1/pull 
Register and install models for use with Lemonade Server.
Parameters
The Lemonade Server built-in model registry has a collection of model names that can be pulled and loaded. The pull endpoint can install any registered model, and it can also register-then-install any model available on Hugging Face.
Install a Model that is Already Registered
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to install. |
Example request:
curl -X POST http://localhost:8000/api/v1/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Installed model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
Register and Install a Model
Registration will place an entry for that model in the user_models.json file, which is located in the user's Lemonade cache (default: ~/.cache/lemonade). Then, the model will be installed. Once the model is registered and installed, it will show up in the models endpoint alongside the built-in models and can be loaded.
The recipe field defines which software framework and device will be used to load and run the model. For more information on OGA and Hugging Face recipes, see the Lemonade API README. For information on GGUF recipes, see llamacpp.
Note: the
model_namefor registering a new model must use theusernamespace, to prevent collisions with built-in models. For example,user.Phi-4-Mini-GGUF.
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Namespaced Lemonade Server model name to register and install. |
checkpoint |
Yes | HuggingFace checkpoint to install. |
recipe |
Yes | Lemonade API recipe to load the model with. |
reasoning |
No | Whether the model is a reasoning model, like DeepSeek (default: false). |
vision |
No | Whether the model has vision capabilities for processing images (default: false). |
mmproj |
No | Multimodal Projector (mmproj) file to use for vision models. |
Example request:
curl -X POST http://localhost:8000/api/v1/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "user.Phi-4-Mini-GGUF",
"checkpoint": "unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M",
"recipe": "llamacpp"
}'
Response format:
{
"status":"success",
"message":"Installed model: user.Phi-4-Mini-GGUF"
}
In case of an error, the status will be error and the message will contain the error message.
POST /api/v1/delete 
Delete a model by removing it from local storage. If the model is currently loaded, it will be unloaded first.
Parameters
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to delete. |
Example request:
curl -X POST http://localhost:8000/api/v1/delete \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Deleted model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
POST /api/v1/load 
Explicitly load a registered model into memory. This is useful to ensure that the model is loaded before you make a request. Installs the model if necessary.
Parameters
| Parameter | Required | Description |
|---|---|---|
model_name |
Yes | Lemonade Server model name to load. |
Example request:
curl -X POST http://localhost:8000/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Loaded model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error and the message will contain the error message.
POST /api/v1/unload 
Explicitly unload a model from memory. This is useful to free up memory while still leaving the server process running (which takes minimal resources but a few seconds to start).
Parameters
This endpoint does not take any parameters.
Example request
curl -X POST http://localhost:8000/api/v1/unload
Response format
{
"status": "success",
"message": "Model unloaded successfully"
}
error and the message will contain the error message.
POST /api/v1/params 
Set the generation parameters for text completion. These parameters will persist across requests until changed.
Parameters
| Parameter | Required | Description |
|---|---|---|
temperature |
No | Controls randomness in the output. Higher values (e.g. 0.8) make the output more random, lower values (e.g. 0.2) make it more focused and deterministic. Defaults to 0.7. |
top_p |
No | Controls diversity via nucleus sampling. Keeps the cumulative probability of tokens above this value. Defaults to 0.95. |
top_k |
No | Controls diversity by limiting to the k most likely next tokens. Defaults to 50. |
min_length |
No | The minimum length of the generated text in tokens. Defaults to 0. |
max_length |
No | The maximum length of the generated text in tokens. Defaults to 2048. |
do_sample |
No | Whether to use sampling (true) or greedy decoding (false). Defaults to true. |
Example request
curl -X POST http://localhost:8000/api/v1/params \
-H "Content-Type: application/json" \
-d '{
"temperature": 0.8,
"top_p": 0.95,
"max_length": 1000
}'
Response format
{
"status": "success",
"message": "Generation parameters set successfully",
"params": {
"temperature": 0.8,
"top_p": 0.95,
"top_k": 40,
"min_length": 0,
"max_length": 1000,
"do_sample": true
}
}
error and the message will contain the error message.
GET /api/v1/health 
Check the health of the server. This endpoint will also return the currently loaded model.
Parameters
This endpoint does not take any parameters.
Example request
curl http://localhost:8000/api/v1/health
Response format
{
"status": "ok",
"checkpoint_loaded": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"model_loaded": "Llama-3.2-1B-Instruct-Hybrid",
}
GET /api/v1/stats 
Performance statistics from the last request.
Parameters
This endpoint does not take any parameters.
Example request
curl http://localhost:8000/api/v1/stats
Response format
{
"time_to_first_token": 2.14,
"tokens_per_second": 33.33,
"input_tokens": 128,
"output_tokens": 5,
"decode_token_times": [0.01, 0.02, 0.03, 0.04, 0.05],
"prompt_tokens": 9
}
Field Descriptions:
time_to_first_token- Time in seconds until the first token was generatedtokens_per_second- Generation speed in tokens per secondinput_tokens- Number of tokens processedoutput_tokens- Number of tokens generateddecode_token_times- Array of time taken for each generated tokenprompt_tokens- Total prompt tokens including cached tokens
GET /api/v1/system-info 
System information endpoint that provides complete hardware details and device enumeration.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
verbose |
No | Include detailed system information. When false (default), returns essential information (OS, processor, memory, devices). When true, includes additional details like Python packages and extended system information. |
Example request
curl "http://localhost:8000/api/v1/system-info"
curl "http://localhost:8000/api/v1/system-info?verbose=true"
Response format
{
"OS Version": "Windows-10-10.0.26100-SP0",
"Processor": "AMD Ryzen AI 9 HX 375 w/ Radeon 890M",
"Physical Memory": "32.0 GB",
"devices": {
"cpu": {
"name": "AMD Ryzen AI 9 HX 375 w/ Radeon 890M",
"cores": 12,
"threads": 24,
"available": true
},
"amd_igpu": {
"name": "AMD Radeon(TM) 890M Graphics",
"memory_mb": 512,
"driver_version": 32.0.12010.10001,
"available": true
},
"amd_dgpu": [],
"npu": {
"name": "AMD NPU",
"driver_version": "32.0.203.257",
"power_mode": "Default",
"available": true
}
}
}
Debugging
To help debug the Lemonade server, you can use the --log-level parameter to control the verbosity of logging information. The server supports multiple logging levels that provide increasing amounts of detail about server operations.
lemonade-server serve --log-level [level]
Where [level] can be one of:
- critical: Only critical errors that prevent server operation.
- error: Error conditions that might allow continued operation.
- warning: Warning conditions that should be addressed.
- info: (Default) General informational messages about server operation.
- debug: Detailed diagnostic information for troubleshooting, including metrics such as input/output token counts, Time To First Token (TTFT), and Tokens Per Second (TPS).
- trace: Very detailed tracing information, including everything from debug level plus all input prompts.
GGUF Support
The OGA models (*-CPU, *-Hybrid) available in Lemonade Server use Lemonade's built-in server implementation. However, Lemonade SDK v7.0.1 introduced support for llama.cpp's llama-server as an alternative backend for CPU and GPU.
The llama-server backend works with Lemonade's suggested *-GGUF models, as well as any .gguf model from Hugging Face. Windows, Ubuntu Linux, and macOS are supported. Details:
- Lemonade Server wraps llama-server with support for the lemonade-server CLI, client web app, and endpoints (e.g., models, pull, load, etc.).
- The chat/completions, completions, embeddings, and reranking endpoints are supported.
- responses is not supported at this time.
- A single Lemonade Server process can seamlessly switch between OGA and GGUF models.
- Lemonade Server will attempt to load models onto GPU with Vulkan first, and if that doesn't work it will fall back to CPU.
- From the end-user's perspective, OGA vs. GGUF should be completely transparent: they wont be aware of whether the built-in server or llama-server is serving their model.
Installing GGUF Models
To install an arbitrary GGUF from Hugging Face, open the Lemonade web app by navigating to http://localhost:8000 in your web browser, click the Model Management tab, and use the Add a Model form.
Platform Support Matrix
| Platform | GPU Acceleration | CPU Architecture |
|---|---|---|
| Windows | ✅ Vulkan, ROCm | ✅ x64 |
| Ubuntu | ✅ Vulkan, ROCm | ✅ x64 |
| macOS | ✅ Metal | ✅ Apple Silicon |
| Other Linux | ⚠️* Vulkan | ⚠️* x64 |
*Other Linux distributions may work but are not officially supported.
FastFlowLM Support
Similar to the llama-server support, Lemonade can also route OpenAI API requests to a FastFlowLM flm serve backend.
The flm serve backend works with Lemonade's suggested *-FLM models, as well as any model mentioned in flm list. Windows is the only supported operating system. Details:
- Lemonade Server wraps flm serve with support for the lemonade-server CLI, client web app, and all Lemonade custom endpoints (e.g., pull, load, etc.).
- The only OpenAI API endpoints supported are models and chat/completions stream=True.
- A single Lemonade Server process can seamlessly switch between FLM, OGA, and GGUF models.
Installing FLM Models
To install an arbitrary FLM model:
1. flm list to view the supported models.
1. Open the Lemonade web app by navigating to http://localhost:8000 in your web browser, click the Model Management tab, and use the Add a Model form.
1. Use the model name from flm list as the "checkpoint name" in the Add a Model form and select "flm" as the recipe.