This spec defines Lemonade’s implementation of the OpenAI API.
| Method | Endpoint | Description | Modality |
|---|---|---|---|
POST |
/v1/chat/completions |
Chat Completions | messages -> completion |
POST |
/v1/completions |
Text Completions | prompt -> completion |
POST |
/v1/embeddings |
Embeddings | text -> vector representations |
POST |
/v1/responses |
Responses API | prompt/messages -> event |
POST |
/v1/audio/transcriptions |
Audio Transcription | audio file -> text |
POST |
/v1/audio/speech |
Text to speech | text -> audio |
WS |
/realtime |
Realtime Audio Transcription, OpenAI SDK compatible | streaming audio -> text |
POST |
/v1/images/generations |
Image Generation | prompt -> image |
POST |
/v1/images/edits |
Image Editing | image + prompt -> edited image |
POST |
/v1/images/variations |
Image Variations | image -> varied image |
POST |
/v1/images/upscale |
Image Upscaling | image + ESRGAN model -> upscaled image |
GET |
/v1/models |
List models available locally | n/a |
GET |
/v1/models/{model_id} |
Retrieve a specific model by ID | n/a |
POST /v1/chat/completionsChat Completions API. You provide a list of messages and receive a completion. This API will also load the model if it is not already loaded.
| Parameter | Required | Description | Status |
|---|---|---|---|
messages |
Yes | Array of messages in the conversation. Each message should have a role (“user” or “assistant”) and content (the message text). |
|
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
tools |
No | A list of tools the model may call. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_completion_tokens. This value is now deprecated by OpenAI in favor of max_completion_tokens |
|
max_completion_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_tokens. |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/chat/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Qwen3-0.6B-GGUF",
"messages": [
{
"role": "user",
"content": "What is the population of Paris?"
}
],
"stream": false
}'
``` === "Bash"
```bash
curl -X POST http://localhost:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-0.6B-GGUF",
"messages": [
{"role": "user", "content": "What is the population of Paris?"}
],
"stream": false
}'
```
To send images to chat/completions, pass a messages[*].content array that mixes text and image_url items. The image can be provided as a base64 data URL (for example, from FileReader.readAsDataURL(...) in web apps).
curl -X POST http://localhost:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."}}
]
}
],
"stream": false
}'
=== “Non-streaming responses”
```json
{
"id": "0",
"object": "chat.completion",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Paris has a population of approximately 2.2 million people in the city proper."
},
"finish_reason": "stop"
}]
}
``` === "Streaming responses"
For streaming responses, the API returns a stream of server-sent events (however, Open AI recommends using their streaming libraries for parsing streaming responses):
```json
{
"id": "0",
"object": "chat.completion.chunk",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"delta": {
"role": "assistant",
"content": "Paris"
}
}]
}
```
POST /v1/completionsText Completions API. You provide a prompt and receive a completion. This API will also load the model if it is not already loaded.
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The prompt to use for the completion. | |
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
echo |
No | Echo back the prompt in addition to the completion. Available on non-streaming mode. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Only available when stream=False. |
|
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion, including input tokens. |
=== “PowerShell”
```powershell
Invoke-WebRequest -Uri "http://localhost:13305/v1/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "What is the population of Paris?",
"stream": false
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "What is the population of Paris?",
"stream": false
}'
```
The following format is used for both streaming and non-streaming responses:
{
"id": "0",
"object": "text_completion",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"text": "Paris has a population of approximately 2.2 million people in the city proper.",
"finish_reason": "stop"
}],
}
POST /v1/embeddingsEmbeddings API. You provide input text and receive vector representations (embeddings) that can be used for semantic search, clustering, and similarity comparisons. This API will also load the model if it is not already loaded.
Note: This endpoint is only available for models using the
llamacpporflmrecipes. ONNX models (OGA recipes) do not support embeddings.
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | The input text or array of texts to embed. Can be a string or an array of strings. | |
model |
Yes | The model to use for generating embeddings. | |
encoding_format |
No | The format to return embeddings in. Supported values: "float" (default), "base64". |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/embeddings" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
```
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, ...]
},
{
"object": "embedding",
"index": 1,
"embedding": [0.0456, -0.0678, 0.1234, ...]
}
],
"model": "nomic-embed-text-v1-GGUF",
"usage": {
"prompt_tokens": 12,
"total_tokens": 12
}
}
Field Descriptions:
object - Type of response object, always "list"data - Array of embedding objects
object - Type of embedding object, always "embedding"index - Index position of the input text in the requestembedding - Vector representation as an array of floatsmodel - Model identifier used to generate the embeddingsusage - Token usage statistics
prompt_tokens - Number of tokens in the inputtotal_tokens - Total tokens processedPOST /v1/responsesResponses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | A list of dictionaries or a string input for the model to respond to. | |
model |
Yes | The model to use for the response. | |
max_output_tokens |
No | The maximum number of output tokens to generate. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. |
The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:
response.createdresponse.output_text.deltaresponse.completedFor a full list of event types, see the API reference for streaming.
=== “PowerShell”
```powershell
Invoke-WebRequest -Uri "http://localhost:13305/v1/responses" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
```
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
```
=== “Non-streaming responses”
```json
{
"id": "0",
"created_at": 1746225832.0,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"object": "response",
"output": [{
"id": "0",
"content": [{
"annotations": [],
"text": "Paris has a population of approximately 2.2 million people in the city proper."
}]
}]
}
```
=== “Streaming Responses” For streaming responses, the API returns a series of events. Refer to OpenAI streaming guide for details.
POST /v1/audio/transcriptionsAudio Transcription API. You provide an audio file and receive a text transcription. This API will also load the model if it is not already loaded.
Note: This endpoint uses whisper.cpp as the backend. Whisper models are automatically downloaded when first used.
Limitations: Only
wavaudio format andjsonresponse format are currently supported.
| Parameter | Required | Description | Status |
|---|---|---|---|
file |
Yes | The audio file to transcribe. Supported formats: wav. | |
model |
Yes | The Whisper model to use for transcription (e.g., Whisper-Tiny, Whisper-Base, Whisper-Small). |
|
language |
No | The language of the audio (ISO 639-1 code, e.g., en, es, fr). If not specified, Whisper will auto-detect the language. |
|
response_format |
No | The format of the response. Currently only json is supported. |
=== “Windows”
```bash
curl -X POST http://localhost:13305/v1/audio/transcriptions ^
-F "file=@C:\path\to\audio.wav" ^
-F "model=Whisper-Tiny"
```
=== “Linux”
```bash
curl -X POST http://localhost:13305/v1/audio/transcriptions \
-F "file=@/path/to/audio.wav" \
-F "model=Whisper-Tiny"
```
{
"text": "Hello, this is a sample transcription of the audio file."
}
Field Descriptions:
text - The transcribed text from the audio fileWS /realtimeRealtime Audio Transcription API via WebSocket (OpenAI SDK compatible). Stream audio from a microphone and receive transcriptions in real-time with Voice Activity Detection (VAD).
Limitations: Only 16kHz mono PCM16 audio format is supported. Uses the same Whisper models as the HTTP transcription endpoint.
The WebSocket server runs on a dynamically assigned port. Discover the port via the /v1/health endpoint (websocket_port field), then connect with the model name:
ws://localhost:<websocket_port>/realtime?model=Whisper-Tiny
Upon connection, the server sends a session.created message with a session ID.
| Message Type | Description |
|---|---|
session.update |
Configure the session (set model, VAD settings, or disable turn detection) |
input_audio_buffer.append |
Send audio data (base64-encoded PCM16) |
input_audio_buffer.commit |
Force transcription of buffered audio |
input_audio_buffer.clear |
Clear audio buffer without transcribing |
| Message Type | Description |
|---|---|
session.created |
Session established, contains session ID |
session.updated |
Session configuration updated |
input_audio_buffer.speech_started |
VAD detected speech start |
input_audio_buffer.speech_stopped |
VAD detected speech end, transcription triggered |
input_audio_buffer.committed |
Audio buffer committed for transcription |
input_audio_buffer.cleared |
Audio buffer cleared |
conversation.item.input_audio_transcription.delta |
Interim/partial transcription (replaceable) |
conversation.item.input_audio_transcription.completed |
Final transcription result |
error |
Error message |
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny"
}
}
{
"type": "input_audio_buffer.append",
"audio": "<base64-encoded PCM16 audio>"
}
Audio should be:
{
"type": "conversation.item.input_audio_transcription.completed",
"transcript": "Hello, this is a test transcription."
}
VAD settings can be configured via session.update:
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny",
"turn_detection": {
"threshold": 0.01,
"silence_duration_ms": 800,
"prefix_padding_ms": 250
}
}
}
| Parameter | Default | Description |
|---|---|---|
threshold |
0.01 | RMS energy threshold for speech detection |
silence_duration_ms |
800 | Silence duration to trigger speech end |
prefix_padding_ms |
250 | Minimum speech duration before triggering |
Set turn_detection to null to disable server-side VAD and use explicit commits instead:
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny",
"turn_detection": null
}
}
See the examples/ directory for a complete, runnable example:
realtime_transcription.py - Python CLI for microphone streaming# Stream from microphone
python examples/realtime_transcription.py --model Whisper-Tiny
turn_detection to null, then use input_audio_buffer.commit to force transcription. In this mode the server buffers audio but does not emit VAD or interim transcription events.input_audio_buffer.clear to discard audio without transcribing.POST /v1/images/generationsImage Generation API. You provide a text prompt and receive a generated image. This API uses stable-diffusion.cpp as the backend.
Note: Image generation uses Stable Diffusion models. Available models include
SD-Turbo(fast, ~4 steps),SDXL-Turbo,SD-1.5, andSDXL-Base-1.0.Performance: CPU inference takes ~4-5 minutes per image. GPU (Vulkan) is faster but may have compatibility issues with some hardware.
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The text description of the image to generate. | |
model |
Yes | The Stable Diffusion model to use (e.g., SD-Turbo, SDXL-Turbo). |
|
size |
No | The size of the generated image. Format: WIDTHxHEIGHT (e.g., 512x512, 256x256). Default: 512x512. |
|
n |
No | Number of images to generate. Currently only 1 is supported. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
steps |
No | Number of inference steps. SD-Turbo works well with 4 steps. Default varies by model. | |
cfg_scale |
No | Classifier-free guidance scale. SD-Turbo uses low values (~1.0). Default varies by model. | |
seed |
No | Random seed for reproducibility. If not specified, a random seed is used. |
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}'
```
POST /v1/images/editsImage Editing API. You provide a source image and a text prompt describing the desired change, and receive an edited image. This API uses stable-diffusion.cpp as the backend.
Note: This endpoint accepts
multipart/form-datarequests (not JSON). Use editing-capable models such asFlux-2-Klein-4BorSD-Turbo.Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.
| Parameter | Required | Description | Status |
|---|---|---|---|
model |
Yes | The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). |
|
image |
Yes | The source image file to edit (PNG). Sent as a file in multipart/form-data. | |
prompt |
Yes | A text description of the desired edit. | |
mask |
No | An optional mask image (PNG). White areas indicate regions to edit; black areas are preserved. | |
size |
No | The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. |
|
n |
No | Number of images to generate. Allowed range: 1–10. Default: 1. Values outside this range are rejected with 400 Bad Request. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
steps |
No | Number of inference steps. Default varies by model. | |
cfg_scale |
No | Classifier-free guidance scale. Default varies by model. | |
seed |
No | Random seed for reproducibility. | |
user |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
background |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
quality |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
input_fidelity |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
output_compression |
No | OpenAI API compatibility field. Accepted; silently ignored by the backend. |
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/images/edits \
-F "model=Flux-2-Klein-4B" \
-F "prompt=Add a red barn and mountains in the background, photorealistic" \
-F "size=512x512" \
-F "n=1" \
-F "response_format=b64_json" \
-F "image=@/path/to/source_image.png"
```
=== “Python (OpenAI client)”
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
response = client.images.edit(
model="Flux-2-Klein-4B",
image=image_file,
prompt="Add a red barn and mountains in the background, photorealistic",
size="512x512",
)
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("edited_image.png", "wb").write(image_data)
```
POST /v1/images/variationsImage Variations API. You provide a source image and receive a variation of it. This API uses stable-diffusion.cpp as the backend.
Note: This endpoint accepts
multipart/form-datarequests (not JSON). Unlike/images/edits, apromptparameter is not supported and will be ignored — the model generates a variation based solely on the input image.Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.
| Parameter | Required | Description | Status |
|---|---|---|---|
model |
Yes | The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). |
|
image |
Yes | The source image file (PNG). Sent as a file in multipart/form-data. | |
size |
No | The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. |
|
n |
No | Number of variations to generate. Integer between 1 and 10 inclusive. Default: 1. Values outside this range result in a 400 Bad Request error. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
user |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. |
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/images/variations \
-F "model=Flux-2-Klein-4B" \
-F "size=512x512" \
-F "n=1" \
-F "response_format=b64_json" \
-F "image=@/path/to/source_image.png"
```
=== “Python (OpenAI client)”
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
response = client.images.create_variation(
model="Flux-2-Klein-4B",
image=image_file,
size="512x512",
n=1,
)
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("variation.png", "wb").write(image_data)
```
POST /v1/images/upscaleImage Upscaling API. You provide a base64-encoded image and a Real-ESRGAN model name, and receive a 4x upscaled image. This API uses the sd-cli binary from stable-diffusion.cpp to perform super-resolution.
Note: Available upscale models are
RealESRGAN-x4plus(general-purpose, 64 MB) andRealESRGAN-x4plus-anime(optimized for anime-style art, 17 MB). Both produce a 4x resolution increase (e.g., 256x256 → 1024x1024).Note: Unlike
/images/editsand/images/variations, this endpoint accepts a JSON body (not multipart/form-data). The image must be provided as a base64-encoded string.
| Parameter | Required | Description | Status |
|---|---|---|---|
image |
Yes | Base64-encoded PNG image to upscale. | |
model |
Yes | The ESRGAN model to use (e.g., RealESRGAN-x4plus, RealESRGAN-x4plus-anime). |
A typical workflow is to generate an image first, then upscale it:
=== “Bash”
```bash
# Step 1: Generate an image and save the base64 response
RESPONSE=$(curl -s -X POST http://localhost:13305/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}')
# Step 2: Build the upscale JSON payload and pipe it to curl via stdin
# (base64 images are too large for command-line interpolation)
echo "$RESPONSE" | python3 -c "
import sys, json
b64 = json.load(sys.stdin)['data'][0]['b64_json']
print(json.dumps({'image': b64, 'model': 'RealESRGAN-x4plus'}))
" | curl -X POST http://localhost:13305/v1/images/upscale \
-H "Content-Type: application/json" \
-d @-
```
=== “PowerShell”
```powershell
# Step 1: Generate an image
$genResponse = Invoke-WebRequest `
-Uri "http://localhost:13305/v1/images/generations" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}'
# Step 2: Extract the base64 image
$imageB64 = ($genResponse.Content | ConvertFrom-Json).data[0].b64_json
# Step 3: Upscale the image with Real-ESRGAN
$body = @{ image = $imageB64; model = "RealESRGAN-x4plus" } | ConvertTo-Json
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/images/upscale" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body $body
```
=== “Python (requests)”
```python
import requests
import base64
BASE_URL = "http://localhost:13305/api/v1"
# Step 1: Generate an image
gen_response = requests.post(f"{BASE_URL}/images/generations", json={
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json",
})
image_b64 = gen_response.json()["data"][0]["b64_json"]
# Step 2: Upscale the image with Real-ESRGAN (512x512 -> 2048x2048)
upscale_response = requests.post(f"{BASE_URL}/images/upscale", json={
"image": image_b64,
"model": "RealESRGAN-x4plus",
})
# Step 3: Save the upscaled image to a file
upscaled_b64 = upscale_response.json()["data"][0]["b64_json"]
with open("upscaled.png", "wb") as f:
f.write(base64.b64decode(upscaled_b64))
```
{
"created": 1742927481,
"data": [
{
"b64_json": "<base64-encoded upscaled PNG>"
}
]
}
Field Descriptions:
created - Unix timestamp of when the upscaled image was generateddata - Array containing the upscaled image
b64_json - Base64-encoded PNG of the upscaled image| Status Code | Condition | Example |
|---|---|---|
| 400 | Missing image field |
{"error": {"message": "Missing 'image' field (base64 encoded)", "type": "invalid_request_error"}} |
| 400 | Missing model field |
{"error": {"message": "Missing 'model' field", "type": "invalid_request_error"}} |
| 404 | Unknown model name | {"error": {"message": "Upscale model not found: bad-model", "type": "invalid_request_error"}} |
| 500 | Upscale failed | {"error": {"message": "ESRGAN upscale failed", "type": "server_error"}} |
POST /v1/audio/speechSpeech Generation API. You provide a text input and receive an audio file. This API uses Kokoros as the backend.
Note: The model to use is called
kokoro-v1. No other model is supported at the moment.Limitations: Only
mp3,wav,opus, andpcmare supported. Streaming is supported inaudio(pcm) mode.
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | The text to speak. | |
model |
Yes | The model to use (e.g., kokoro-v1). |
|
speed |
No | Speaking speed. Default: 1.0. |
|
voice |
No | The voice to use. All OpenAI-defined voices can be used (alloy, ash, …), as well as those defined by the kokoro model (af_sky, am_echo, …). Default: shimmer |
|
response_format |
No | Format of the response. mp3, wav, opus, and pcm are supported. Default: mp3 |
|
stream_format |
No | If set, the response will be streamed. Only audio is supported, which will output pcm audio. Default: not set |
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-v1",
"input": "Lemonade can speak!",
"speed": 1.0,
"steps": 4,
"response_format": "mp3"
}'
```
The generated audio file is returned as-is.
GET /v1/modelsReturns a list of models available on the server in an OpenAI-compatible format. Each model object includes extended fields like checkpoint, recipe, size, downloaded, and labels.
By default, only models available locally (downloaded) are shown, matching OpenAI API behavior.
| Parameter | Required | Description |
|---|---|---|
show_all |
No | If set to true, returns all models from the catalog including those not yet downloaded. Defaults to false. |
# Show only downloaded models (OpenAI-compatible)
curl http://localhost:13305/v1/models
# Show all models including not-yet-downloaded (extended usage)
curl http://localhost:13305/v1/models?show_all=true
{
"object": "list",
"data": [
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"]
},
{
"id": "Gemma-3-4b-it-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
"recipe": "llamacpp",
"size": 3.61,
"downloaded": true,
"suggested": true,
"labels": ["hot", "vision"]
},
{
"id": "SD-Turbo",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "stabilityai/sd-turbo:sd_turbo.safetensors",
"recipe": "sd-cpp",
"size": 5.2,
"downloaded": true,
"suggested": true,
"labels": ["image"],
"image_defaults": {
"steps": 4,
"cfg_scale": 1.0,
"width": 512,
"height": 512
}
}
]
}
Field Descriptions:
object - Type of response object, always "list"data - Array of model objects with the following fields:
id - Model identifier (used for loading and inference requests)created - Unix timestamp of when the model entry was createdobject - Type of object, always "model"owned_by - Owner of the model, always "lemonade"checkpoint - Full checkpoint identifier on Hugging Facerecipe - Backend/device recipe used to load the model (e.g., "ryzenai-llm", "llamacpp", "flm")size - Model size in GB (omitted for models without size information)downloaded - Boolean indicating if the model is downloaded and available locallysuggested - Boolean indicating if the model is recommended for general uselabels - Array of tags describing the model (e.g., "hot", "reasoning", "vision", "embeddings", "reranking", "coding", "tool-calling", "image")image_defaults - (Image models only) Default generation parameters for the model:
steps - Number of inference steps (e.g., 4 for turbo models, 20 for standard models)cfg_scale - Classifier-free guidance scale (e.g., 1.0 for turbo models, 7.5 for standard models)width - Default image width in pixelsheight - Default image height in pixelsGET /v1/models/{model_id}Retrieve a specific model by its ID. Returns the same model object format as the list endpoint above.
| Parameter | Required | Description |
|---|---|---|
model_id |
Yes | The ID of the model to retrieve. Must match one of the model IDs from the models list. |
curl http://localhost:13305/v1/models/Qwen3-0.6B-GGUF
Returns a single model object with the same fields as described in the models list endpoint above.
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"],
"recipe_options" {
"ctx_size": 8192,
"llamacpp_args": "--no-mmap",
"llamacpp_backend": "rocm"
}
}
If the model is not found, the endpoint returns a 404 error:
{
"error": {
"message": "Model Qwen3-0.6B-GGUF has not been found",
"type": "not_found"
}
}