The lemonade
SDK provides a standards-compliant server process that provides a REST API to enable communication with other applications.
Lemonade Server currently supports two backends:
Backend | Model Format | Description |
---|---|---|
ONNX Runtime GenAI (OGA) | .ONNX |
Lemonade’s built-in server, recommended for standard use on AMD platforms. |
Llama.cpp (Experimental) | .GGUF |
Uses llama.cpp’s Vulkan-powered llama-server backend. More details here. |
Right now, the key endpoints of the OpenAI API are available.
We are also actively investigating and developing additional endpoints that will improve the experience of local applications.
/api/v1/chat/completions
- Chat Completions (messages -> completion)/api/v1/completions
- Text Completions (prompt -> completion)POST api/v1/responses - Chat Completions (prompt |
messages -> event) |
/api/v1/models
- List models available locally🚧 These additional endpoints are a preview that is under active development. The API specification is subject to change.
These additional endpoints were inspired by the LM Studio REST API, Ollama API, and OpenAI API.
They focus on enabling client applications by extending existing cloud-focused APIs (e.g., OpenAI) to also include the ability to load and unload models before completion requests are made. These extensions allow for a greater degree of UI/UX responsiveness in native applications by allowing applications to:
The additional endpoints under development are:
/api/v1/pull
- Install a model/api/v1/load
- Load a model/api/v1/unload
- Unload a model/api/v1/params
- Set generation parameters/api/v1/health
- Check server health/api/v1/stats
- Performance statistics from the last request🚧 We are in the process of developing this interface. Let us know what’s important to you on Github or by email (lemonade at amd dot com).
NOTE: This server is intended for use on local systems only. Do not expose the server port to the open internet.
See the Lemonade Server getting started instructions.
If you have Lemonade installed in a Python environment, simply activate it and run the following command to start the server:
lemonade-server-dev serve
POST /api/v1/chat/completions
Chat Completions API. You provide a list of messages and receive a completion. This API will also load the model if it is not already loaded.
Parameter | Required | Description | Status |
---|---|---|---|
messages |
Yes | Array of messages in the conversation. Each message should have a role (“user” or “assistant”) and content (the message text). |
|
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. | |
temperature |
No | What sampling temperature to use. | |
tools |
No | A list of tools the model may call. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_completion_tokens . This value is now deprecated by OpenAI in favor of max_completion_tokens |
|
max_completion_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_tokens . |
Note: The value for
model
is either a Lemonade Server model name, or a checkpoint that has been pre-loaded using the load endpoint.
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:8000/api/v1/chat/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"messages": [
{
"role": "user",
"content": "What is the population of Paris?"
}
],
"stream": false
}'
``` === "Bash"
```bash
curl -X POST http://localhost:8000/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"messages": [
{"role": "user", "content": "What is the population of Paris?"}
],
"stream": false
}'
```
=== “Non-streaming responses”
```json
{
"id": "0",
"object": "chat.completion",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Paris has a population of approximately 2.2 million people in the city proper."
},
"finish_reason": "stop"
}]
}
``` === "Streaming responses"
For streaming responses, the API returns a stream of server-sent events (however, Open AI recommends using their streaming libraries for parsing streaming responses):
```json
{
"id": "0",
"object": "chat.completion.chunk",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"delta": {
"role": "assistant",
"content": "Paris"
}
}]
}
```
POST /api/v1/completions
Text Completions API. You provide a prompt and receive a completion. This API will also load the model if it is not already loaded.
Parameter | Required | Description | Status |
---|---|---|---|
prompt |
Yes | The prompt to use for the completion. | |
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
echo |
No | Echo back the prompt in addition to the completion. Available on non-streaming mode. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Only available when stream=False . |
|
temperature |
No | What sampling temperature to use. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion, including input tokens. |
Note: The value for
model
is either a Lemonade Server model name, or a checkpoint that has been pre-loaded using the load endpoint.
=== “PowerShell”
```powershell
Invoke-WebRequest -Uri "http://localhost:8000/api/v1/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"prompt": "What is the population of Paris?",
"stream": false
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:8000/api/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"prompt": "What is the population of Paris?",
"stream": false
}'
```
The following format is used for both streaming and non-streaming responses:
{
"id": "0",
"object": "text_completion",
"created": 1742927481,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"choices": [{
"index": 0,
"text": "Paris has a population of approximately 2.2 million people in the city proper.",
"finish_reason": "stop"
}],
}
POST /api/v1/responses
Responses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.
Parameter | Required | Description | Status |
---|---|---|---|
input |
Yes | A list of dictionaries or a string input for the model to respond to. | |
model |
Yes | The model to use for the response. | |
max_output_tokens |
No | The maximum number of output tokens to generate. | |
temperature |
No | What sampling temperature to use. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. |
Note: The value for
model
is either a Lemonade Server model name, or a checkpoint that has been pre-loaded using the load endpoint.
The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:
response.created
response.output_text.delta
response.completed
For a full list of event types, see the API reference for streaming.
=== “PowerShell”
```powershell
Invoke-WebRequest -Uri "http://localhost:8000/api/v1/responses" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:8000/api/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
```
=== “Non-streaming responses”
```json
{
"id": "0",
"created_at": 1746225832.0,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"object": "response",
"output": [{
"id": "0",
"content": [{
"annotations": [],
"text": "Paris has a population of approximately 2.2 million people in the city proper."
}]
}]
}
```
=== “Streaming Responses” For streaming responses, the API returns a series of events. Refer to OpenAI streaming guide for details.
GET /api/v1/models
Returns a list of key models available on the server in an OpenAI-compatible format. We also expanded each model object with the checkpoint
and recipe
fields, which may be used to load a model using the load
endpoint.
This list is curated based on what works best for Ryzen AI Hybrid. Only models available locally are shown.
This endpoint does not take any parameters.
curl http://localhost:8000/api/v1/models
{
"object": "list",
"data": [
{
"id": "Qwen2.5-0.5B-Instruct-CPU",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx",
"recipe": "oga-cpu"
},
{
"id": "Llama-3.2-1B-Instruct-Hybrid",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"recipe": "oga-hybrid"
},
]
}
GET /api/v1/pull
Install a model by downloading it and registering it with Lemonade Server.
Parameter | Required | Description |
---|---|---|
model_name |
Yes | Lemonade Server model name to load. |
Example request:
curl http://localhost:8000/api/v1/pull \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Installed model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error
and the message will contain the error message.
GET /api/v1/load
Explicitly load a model into memory. This is useful to ensure that the model is loaded before you make a request. Installs the model if necessary.
There are two distinct ways to load a model:
The parameters for these two ways of loading are mutually exclusive. We intend load-by-name to be used in the general case, since that references a curated set of models in a concise way. Load-by-checkpoint can be used in the event that a user/developer wants to try a model that isn’t in the curated list.
Load by Lemonade Server Model Name (Recommended)
Parameter | Required | Description |
---|---|---|
model_name |
Yes | Lemonade Server model name to load. |
Example request:
curl http://localhost:8000/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Qwen2.5-0.5B-Instruct-CPU"
}'
Response format:
{
"status":"success",
"message":"Loaded model: Qwen2.5-0.5B-Instruct-CPU"
}
In case of an error, the status will be error
and the message will contain the error message.
Load by Hugging Face Checkpoint and Lemonade Recipe
Note: load-by-checkpoint will download that checkpoint if it is not already available in your Hugging Face cache.
Parameter | Required | Description |
---|---|---|
checkpoint |
Yes | HuggingFace checkpoint to load. |
recipe |
Yes | Lemonade API recipe to load the model on. |
reasoning |
No | Whether the model is a reasoning model, like DeepSeek (default: false). |
mmproj |
No | Multimodal Projector (mmproj) file to use for vision models. |
Example request:
curl http://localhost:8000/api/v1/load \
-H "Content-Type: application/json" \
-d '{
"checkpoint": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx",
"recipe": "oga-cpu"
}'
Response format:
{
"status":"success",
"message":"Loaded model: amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx"
}
In case of an error, the status will be error
and the message will contain the error message.
POST /api/v1/unload
Explicitly unload a model from memory. This is useful to free up memory while still leaving the server process running (which takes minimal resources but a few seconds to start).
This endpoint does not take any parameters.
curl http://localhost:8000/api/v1/unload
{
"status": "success",
"message": "Model unloaded successfully"
}
In case of an error, the status will be error
and the message will contain the error message.
POST /api/v1/params
Set the generation parameters for text completion. These parameters will persist across requests until changed.
Parameter | Required | Description |
---|---|---|
temperature |
No | Controls randomness in the output. Higher values (e.g. 0.8) make the output more random, lower values (e.g. 0.2) make it more focused and deterministic. Defaults to 0.7. |
top_p |
No | Controls diversity via nucleus sampling. Keeps the cumulative probability of tokens above this value. Defaults to 0.95. |
top_k |
No | Controls diversity by limiting to the k most likely next tokens. Defaults to 50. |
min_length |
No | The minimum length of the generated text in tokens. Defaults to 0. |
max_length |
No | The maximum length of the generated text in tokens. Defaults to 2048. |
do_sample |
No | Whether to use sampling (true) or greedy decoding (false). Defaults to true. |
curl http://localhost:8000/api/v1/params \
-H "Content-Type: application/json" \
-d '{
"temperature": 0.8,
"top_p": 0.95,
"max_length": 1000
}'
{
"status": "success",
"message": "Generation parameters set successfully",
"params": {
"temperature": 0.8,
"top_p": 0.95,
"top_k": 40,
"min_length": 0,
"max_length": 1000,
"do_sample": true
}
}
In case of an error, the status will be error
and the message will contain the error message.
GET /api/v1/health
Check the health of the server. This endpoint will also return the currently loaded model.
This endpoint does not take any parameters.
curl http://localhost:8000/api/v1/health
{
"status": "ok",
"checkpoint_loaded": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
"model_loaded": "Llama-3.2-1B-Instruct-Hybrid",
}
GET /api/v1/stats
Performance statistics from the last request.
This endpoint does not take any parameters.
curl http://localhost:8000/api/v1/stats
{
"time_to_first_token": 2.14,
"tokens_per_second": 33.33,
"input_tokens": 128,
"output_tokens": 5,
"decode_token_times": [0.01, 0.02, 0.03, 0.04, 0.05]
}
To help debug the Lemonade server, you can use the --log-level
parameter to control the verbosity of logging information. The server supports multiple logging levels that provide increasing amounts of detail about server operations.
lemonade serve --log-level [level]
Where [level]
can be one of:
The OGA models (*-CPU
, *-Hybrid
) available in Lemonade Server use Lemonade’s built-in server implementation. However, Lemonade SDK v7.0.1 introduced experimental support for llama.cpp’s Vulkan llama-server
as an alternative backend for CPU and GPU.
The llama-server
backend works with Lemonade’s suggested *-GGUF
models, as well as any .gguf model from Hugging Face. Details:
llama-server
with support for the lemonade-server
CLI, client web app, and endpoints (e.g., models
, pull
, load
, etc.).
chat/completions
endpoint, in streaming mode, is the only completions/responses endpoint supported.completions
, and responses
are not supported at this time.llama-server
is serving their model.To load an arbitrary GGUF from Hugging Face, use the load endpoint with the recipe set to llamacpp
:
curl http://localhost:8000/api/v0/load \
-H "Content-Type: application/json" \
-d '{
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp"
}'