OpenAI-Compatible API
This spec defines Lemonade's implementation of the OpenAI API.
| Method | Endpoint | Description | Modality |
|---|---|---|---|
POST |
/v1/chat/completions |
Chat Completions | messages -> completion |
POST |
/v1/completions |
Text Completions | prompt -> completion |
POST |
/v1/embeddings |
Embeddings | text -> vector representations |
POST |
/v1/responses |
Responses API | prompt/messages -> event |
POST |
/v1/audio/transcriptions |
Audio Transcription | audio file -> text |
POST |
/v1/audio/speech |
Text to speech | text -> audio |
WS |
/realtime |
Realtime Audio Transcription, OpenAI SDK compatible | streaming audio -> text |
POST |
/v1/images/generations |
Image Generation | prompt -> image |
POST |
/v1/images/edits |
Image Editing | image + prompt -> edited image |
POST |
/v1/images/variations |
Image Variations | image -> varied image |
POST |
/v1/images/upscale |
Image Upscaling | image + ESRGAN model -> upscaled image |
GET |
/v1/models |
List models available locally | n/a |
GET |
/v1/models/{model_id} |
Retrieve a specific model by ID | n/a |
POST /v1/chat/completions
Chat Completions API. You provide a list of messages and receive a completion. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
messages |
Yes | Array of messages in the conversation. Each message should have a role ("user" or "assistant") and content (the message text). |
|
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
tools |
No | A list of tools the model may call. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_completion_tokens. This value is now deprecated by OpenAI in favor of max_completion_tokens |
|
max_completion_tokens |
No | An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_tokens. |
Example request
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/chat/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Qwen3-0.6B-GGUF",
"messages": [
{
"role": "user",
"content": "What is the population of Paris?"
}
],
"stream": false
}'
curl -X POST http://localhost:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-0.6B-GGUF",
"messages": [
{"role": "user", "content": "What is the population of Paris?"}
],
"stream": false
}'
Image understanding input format (OpenAI-compatible)
To send images to chat/completions, pass a messages[*].content array that mixes text and image_url items. The image can be provided as a base64 data URL (for example, from FileReader.readAsDataURL(...) in web apps).
curl -X POST http://localhost:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."}}
]
}
],
"stream": false
}'
Response format
{
"id": "0",
"object": "chat.completion",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Paris has a population of approximately 2.2 million people in the city proper."
},
"finish_reason": "stop"
}]
}
For streaming responses, the API returns a stream of server-sent events (however, Open AI recommends using their streaming libraries for parsing streaming responses):
{
"id": "0",
"object": "chat.completion.chunk",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"delta": {
"role": "assistant",
"content": "Paris"
}
}]
}
POST /v1/completions
Text Completions API. You provide a prompt and receive a completion. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The prompt to use for the completion. | |
model |
Yes | The model to use for the completion. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. | |
stop |
No | Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. | |
echo |
No | Echo back the prompt in addition to the completion. Available on non-streaming mode. | |
logprobs |
No | Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Only available when stream=False. |
|
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
max_tokens |
No | An upper bound for the number of tokens that can be generated for a completion, including input tokens. |
Example request
Invoke-WebRequest -Uri "http://localhost:13305/v1/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "What is the population of Paris?",
"stream": false
}'
curl -X POST http://localhost:13305/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "What is the population of Paris?",
"stream": false
}'
Response format
The following format is used for both streaming and non-streaming responses:
{
"id": "0",
"object": "text_completion",
"created": 1742927481,
"model": "Qwen3-0.6B-GGUF",
"choices": [{
"index": 0,
"text": "Paris has a population of approximately 2.2 million people in the city proper.",
"finish_reason": "stop"
}],
}
POST /v1/embeddings
Embeddings API. You provide input text and receive vector representations (embeddings) that can be used for semantic search, clustering, and similarity comparisons. This API will also load the model if it is not already loaded.
Note: This endpoint is only available for models using the
llamacpporflmrecipes. ONNX models (OGA recipes) do not support embeddings.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | The input text or array of texts to embed. Can be a string or an array of strings. | |
model |
Yes | The model to use for generating embeddings. | |
encoding_format |
No | The format to return embeddings in. Supported values: "float" (default), "base64". |
Example request
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/embeddings" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
curl -X POST http://localhost:13305/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1-GGUF",
"input": ["Hello, world!", "How are you?"],
"encoding_format": "float"
}'
Response format
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, ...]
},
{
"object": "embedding",
"index": 1,
"embedding": [0.0456, -0.0678, 0.1234, ...]
}
],
"model": "nomic-embed-text-v1-GGUF",
"usage": {
"prompt_tokens": 12,
"total_tokens": 12
}
}
Field Descriptions:
object- Type of response object, always"list"data- Array of embedding objectsobject- Type of embedding object, always"embedding"index- Index position of the input text in the requestembedding- Vector representation as an array of floatsmodel- Model identifier used to generate the embeddingsusage- Token usage statisticsprompt_tokens- Number of tokens in the inputtotal_tokens- Total tokens processed
POST /v1/responses
Responses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | A list of dictionaries or a string input for the model to respond to. | |
model |
Yes | The model to use for the response. | |
max_output_tokens |
No | The maximum number of output tokens to generate. | |
temperature |
No | What sampling temperature to use. | |
repeat_penalty |
No | Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. | |
top_k |
No | Integer that controls the number of top tokens to consider during sampling. | |
top_p |
No | Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. | |
stream |
No | If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. |
Streaming Events
The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:
response.createdresponse.output_text.deltaresponse.completed
For a full list of event types, see the API reference for streaming.
Example request
Invoke-WebRequest -Uri "http://localhost:13305/v1/responses" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
curl -X POST http://localhost:13305/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-Hybrid",
"input": "What is the population of Paris?",
"stream": false
}'
Response format
{
"id": "0",
"created_at": 1746225832.0,
"model": "Llama-3.2-1B-Instruct-Hybrid",
"object": "response",
"output": [{
"id": "0",
"content": [{
"annotations": [],
"text": "Paris has a population of approximately 2.2 million people in the city proper."
}]
}]
}
For streaming responses, the API returns a series of events. Refer to OpenAI streaming guide for details.
POST /v1/audio/transcriptions
Audio Transcription API. You provide an audio file and receive a text transcription. This API will also load the model if it is not already loaded.
Note: This endpoint uses whisper.cpp as the backend. Whisper models are automatically downloaded when first used.
Limitations: Only
wavaudio format andjsonresponse format are currently supported.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
file |
Yes | The audio file to transcribe. Supported formats: wav. | |
model |
Yes | The Whisper model to use for transcription (e.g., Whisper-Tiny, Whisper-Base, Whisper-Small). |
|
language |
No | The language of the audio (ISO 639-1 code, e.g., en, es, fr). If not specified, Whisper will auto-detect the language. |
|
response_format |
No | The format of the response. Currently only json is supported. |
Example request
curl -X POST http://localhost:13305/v1/audio/transcriptions ^
-F "file=@C:\path\to\audio.wav" ^
-F "model=Whisper-Tiny"
curl -X POST http://localhost:13305/v1/audio/transcriptions \
-F "file=@/path/to/audio.wav" \
-F "model=Whisper-Tiny"
Response format
{
"text": "Hello, this is a sample transcription of the audio file."
}
Field Descriptions:
text- The transcribed text from the audio file
WS /realtime
Realtime Audio Transcription API via WebSocket (OpenAI SDK compatible). Stream audio from a microphone and receive transcriptions in real-time with Voice Activity Detection (VAD).
Limitations: Only 16kHz mono PCM16 audio format is supported. Uses the same Whisper models as the HTTP transcription endpoint.
Connection
The WebSocket server runs on a dynamically assigned port. Discover the port via the /v1/health endpoint (websocket_port field), then connect with the model name:
ws://localhost:<websocket_port>/realtime?model=Whisper-Tiny
Upon connection, the server sends a session.created message with a session ID.
Client → Server Messages
| Message Type | Description |
|---|---|
session.update |
Configure the session (set model, VAD settings, or disable turn detection) |
input_audio_buffer.append |
Send audio data (base64-encoded PCM16) |
input_audio_buffer.commit |
Force transcription of buffered audio |
input_audio_buffer.clear |
Clear audio buffer without transcribing |
Server → Client Messages
| Message Type | Description |
|---|---|
session.created |
Session established, contains session ID |
session.updated |
Session configuration updated |
input_audio_buffer.speech_started |
VAD detected speech start |
input_audio_buffer.speech_stopped |
VAD detected speech end, transcription triggered |
input_audio_buffer.committed |
Audio buffer committed for transcription |
input_audio_buffer.cleared |
Audio buffer cleared |
conversation.item.input_audio_transcription.delta |
Interim/partial transcription (replaceable) |
conversation.item.input_audio_transcription.completed |
Final transcription result |
error |
Error message |
Example: Configure Session
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny"
}
}
Example: Send Audio
{
"type": "input_audio_buffer.append",
"audio": "<base64-encoded PCM16 audio>"
}
Audio should be: - 16kHz sample rate - Mono (single channel) - 16-bit signed integer (PCM16) - Base64 encoded - Sent in chunks (~85ms recommended)
Example: Transcription Result
{
"type": "conversation.item.input_audio_transcription.completed",
"transcript": "Hello, this is a test transcription."
}
VAD Configuration
VAD settings can be configured via session.update:
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny",
"turn_detection": {
"threshold": 0.01,
"silence_duration_ms": 800,
"prefix_padding_ms": 250
}
}
}
| Parameter | Default | Description |
|---|---|---|
threshold |
0.01 | RMS energy threshold for speech detection |
silence_duration_ms |
800 | Silence duration to trigger speech end |
prefix_padding_ms |
250 | Minimum speech duration before triggering |
Set turn_detection to null to disable server-side VAD and use explicit commits instead:
{
"type": "session.update",
"session": {
"model": "Whisper-Tiny",
"turn_detection": null
}
}
Code Examples
See the examples/ directory for a complete, runnable example:
realtime_transcription.py- Python CLI for microphone streaming
# Stream from microphone
python examples/realtime_transcription.py --model Whisper-Tiny
Integration Notes
- Audio Format: Server expects 16kHz mono PCM16. Higher sample rates must be downsampled client-side.
- Chunk Size: Send audio in ~85-256ms chunks for optimal latency/efficiency.
- VAD Behavior: Server automatically detects speech boundaries and triggers transcription on speech end.
- Manual Commit: Set
turn_detectiontonull, then useinput_audio_buffer.committo force transcription. In this mode the server buffers audio but does not emit VAD or interim transcription events. - Clear Buffer: Use
input_audio_buffer.clearto discard audio without transcribing. - Chunking: We are still tuning the chunking to balance latency vs. accuracy.
POST /v1/images/generations
Image Generation API. You provide a text prompt and receive a generated image. This API uses stable-diffusion.cpp as the backend.
Note: Image generation uses Stable Diffusion models. Available models include
SD-Turbo(fast, ~4 steps),SDXL-Turbo,SD-1.5, andSDXL-Base-1.0.Performance: CPU inference takes ~4-5 minutes per image. GPU (Vulkan) is faster but may have compatibility issues with some hardware.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
prompt |
Yes | The text description of the image to generate. | |
model |
Yes | The Stable Diffusion model to use (e.g., SD-Turbo, SDXL-Turbo). |
|
size |
No | The size of the generated image. Format: WIDTHxHEIGHT (e.g., 512x512, 256x256). Default: 512x512. |
|
n |
No | Number of images to generate. Currently only 1 is supported. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
steps |
No | Number of inference steps. SD-Turbo works well with 4 steps. Default varies by model. | |
cfg_scale |
No | Classifier-free guidance scale. SD-Turbo uses low values (~1.0). Default varies by model. | |
seed |
No | Random seed for reproducibility. If not specified, a random seed is used. |
Example request
curl -X POST http://localhost:13305/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}'
POST /v1/images/edits
Image Editing API. You provide a source image and a text prompt describing the desired change, and receive an edited image. This API uses stable-diffusion.cpp as the backend.
Note: This endpoint accepts
multipart/form-datarequests (not JSON). Use editing-capable models such asFlux-2-Klein-4BorSD-Turbo.Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
model |
Yes | The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). |
|
image |
Yes | The source image file to edit (PNG). Sent as a file in multipart/form-data. | |
prompt |
Yes | A text description of the desired edit. | |
mask |
No | An optional mask image (PNG). White areas indicate regions to edit; black areas are preserved. | |
size |
No | The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. |
|
n |
No | Number of images to generate. Allowed range: 1–10. Default: 1. Values outside this range are rejected with 400 Bad Request. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
steps |
No | Number of inference steps. Default varies by model. | |
cfg_scale |
No | Classifier-free guidance scale. Default varies by model. | |
seed |
No | Random seed for reproducibility. | |
user |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
background |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
quality |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
input_fidelity |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. | |
output_compression |
No | OpenAI API compatibility field. Accepted; silently ignored by the backend. |
Example request
curl -X POST http://localhost:13305/v1/images/edits \
-F "model=Flux-2-Klein-4B" \
-F "prompt=Add a red barn and mountains in the background, photorealistic" \
-F "size=512x512" \
-F "n=1" \
-F "response_format=b64_json" \
-F "image=@/path/to/source_image.png"
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
response = client.images.edit(
model="Flux-2-Klein-4B",
image=image_file,
prompt="Add a red barn and mountains in the background, photorealistic",
size="512x512",
)
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("edited_image.png", "wb").write(image_data)
POST /v1/images/variations
Image Variations API. You provide a source image and receive a variation of it. This API uses stable-diffusion.cpp as the backend.
Note: This endpoint accepts
multipart/form-datarequests (not JSON). Unlike/images/edits, apromptparameter is not supported and will be ignored — the model generates a variation based solely on the input image.Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
model |
Yes | The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). |
|
image |
Yes | The source image file (PNG). Sent as a file in multipart/form-data. | |
size |
No | The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. |
|
n |
No | Number of variations to generate. Integer between 1 and 10 inclusive. Default: 1. Values outside this range result in a 400 Bad Request error. |
|
response_format |
No | Format of the response. Only b64_json (base64-encoded image) is supported. |
|
user |
No | OpenAI API compatibility field. Accepted but not forwarded to the backend. |
Example request
curl -X POST http://localhost:13305/v1/images/variations \
-F "model=Flux-2-Klein-4B" \
-F "size=512x512" \
-F "n=1" \
-F "response_format=b64_json" \
-F "image=@/path/to/source_image.png"
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
response = client.images.create_variation(
model="Flux-2-Klein-4B",
image=image_file,
size="512x512",
n=1,
)
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("variation.png", "wb").write(image_data)
POST /v1/images/upscale
Image Upscaling API. You provide a base64-encoded image and a Real-ESRGAN model name, and receive a 4x upscaled image. This API uses the sd-cli binary from stable-diffusion.cpp to perform super-resolution.
Note: Available upscale models are
RealESRGAN-x4plus(general-purpose, 64 MB) andRealESRGAN-x4plus-anime(optimized for anime-style art, 17 MB). Both produce a 4x resolution increase (e.g., 256x256 → 1024x1024).Note: Unlike
/images/editsand/images/variations, this endpoint accepts a JSON body (not multipart/form-data). The image must be provided as a base64-encoded string.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
image |
Yes | Base64-encoded PNG image to upscale. | |
model |
Yes | The ESRGAN model to use (e.g., RealESRGAN-x4plus, RealESRGAN-x4plus-anime). |
Example request
A typical workflow is to generate an image first, then upscale it:
# Step 1: Generate an image and save the base64 response
RESPONSE=$(curl -s -X POST http://localhost:13305/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}')
# Step 2: Build the upscale JSON payload and pipe it to curl via stdin
# (base64 images are too large for command-line interpolation)
echo "$RESPONSE" | python3 -c "
import sys, json
b64 = json.load(sys.stdin)['data'][0]['b64_json']
print(json.dumps({'image': b64, 'model': 'RealESRGAN-x4plus'}))
" | curl -X POST http://localhost:13305/v1/images/upscale \
-H "Content-Type: application/json" \
-d @-
# Step 1: Generate an image
$genResponse = Invoke-WebRequest `
-Uri "http://localhost:13305/v1/images/generations" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json"
}'
# Step 2: Extract the base64 image
$imageB64 = ($genResponse.Content | ConvertFrom-Json).data[0].b64_json
# Step 3: Upscale the image with Real-ESRGAN
$body = @{ image = $imageB64; model = "RealESRGAN-x4plus" } | ConvertTo-Json
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/images/upscale" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body $body
import requests
import base64
BASE_URL = "http://localhost:13305/api/v1"
# Step 1: Generate an image
gen_response = requests.post(f"{BASE_URL}/images/generations", json={
"model": "SD-Turbo",
"prompt": "A serene mountain landscape at sunset",
"size": "512x512",
"steps": 4,
"response_format": "b64_json",
})
image_b64 = gen_response.json()["data"][0]["b64_json"]
# Step 2: Upscale the image with Real-ESRGAN (512x512 -> 2048x2048)
upscale_response = requests.post(f"{BASE_URL}/images/upscale", json={
"image": image_b64,
"model": "RealESRGAN-x4plus",
})
# Step 3: Save the upscaled image to a file
upscaled_b64 = upscale_response.json()["data"][0]["b64_json"]
with open("upscaled.png", "wb") as f:
f.write(base64.b64decode(upscaled_b64))
Response format
{
"created": 1742927481,
"data": [
{
"b64_json": "<base64-encoded upscaled PNG>"
}
]
}
Field Descriptions:
created- Unix timestamp of when the upscaled image was generateddata- Array containing the upscaled imageb64_json- Base64-encoded PNG of the upscaled image
Error responses
| Status Code | Condition | Example |
|---|---|---|
| 400 | Missing image field |
{"error": {"message": "Missing 'image' field (base64 encoded)", "type": "invalid_request_error"}} |
| 400 | Missing model field |
{"error": {"message": "Missing 'model' field", "type": "invalid_request_error"}} |
| 404 | Unknown model name | {"error": {"message": "Upscale model not found: bad-model", "type": "invalid_request_error"}} |
| 500 | Upscale failed | {"error": {"message": "ESRGAN upscale failed", "type": "server_error"}} |
POST /v1/audio/speech
Speech Generation API. You provide a text input and receive an audio file. This API uses Kokoros as the backend.
Note: The model to use is called
kokoro-v1. No other model is supported at the moment.Limitations: Only
mp3,wav,opus, andpcmare supported. Streaming is supported inaudio(pcm) mode.
Parameters
| Parameter | Required | Description | Status |
|---|---|---|---|
input |
Yes | The text to speak. | |
model |
Yes | The model to use (e.g., kokoro-v1). |
|
speed |
No | Speaking speed. Default: 1.0. |
|
voice |
No | The voice to use. All OpenAI-defined voices can be used (alloy, ash, ...), as well as those defined by the kokoro model (af_sky, am_echo, ...). Default: shimmer |
|
response_format |
No | Format of the response. mp3, wav, opus, and pcm are supported. Default: mp3 |
|
stream_format |
No | If set, the response will be streamed. Only audio is supported, which will output pcm audio. Default: not set |
Example request
curl -X POST http://localhost:13305/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-v1",
"input": "Lemonade can speak!",
"speed": 1.0,
"steps": 4,
"response_format": "mp3"
}'
Response format
The generated audio file is returned as-is.
GET /v1/models
Returns a list of models available on the server in an OpenAI-compatible format. Each model object includes extended fields like checkpoint, recipe, size, downloaded, and labels.
By default, only models available locally (downloaded) are shown, matching OpenAI API behavior.
Parameters
| Parameter | Required | Description |
|---|---|---|
show_all |
No | If set to true, returns all models from the catalog including those not yet downloaded. Defaults to false. |
Example request
# Show only downloaded models (OpenAI-compatible)
curl http://localhost:13305/v1/models
# Show all models including not-yet-downloaded (extended usage)
curl http://localhost:13305/v1/models?show_all=true
Response format
{
"object": "list",
"data": [
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"]
},
{
"id": "Gemma-3-4b-it-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
"recipe": "llamacpp",
"size": 3.61,
"downloaded": true,
"suggested": true,
"labels": ["hot", "vision"]
},
{
"id": "SD-Turbo",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "stabilityai/sd-turbo:sd_turbo.safetensors",
"recipe": "sd-cpp",
"size": 5.2,
"downloaded": true,
"suggested": true,
"labels": ["image"],
"image_defaults": {
"steps": 4,
"cfg_scale": 1.0,
"width": 512,
"height": 512
}
}
]
}
Field Descriptions:
object- Type of response object, always"list"data- Array of model objects with the following fields:id- Model identifier (used for loading and inference requests)created- Unix timestamp of when the model entry was createdobject- Type of object, always"model"owned_by- Owner of the model, always"lemonade"checkpoint- Full checkpoint identifier on Hugging Facerecipe- Backend/device recipe used to load the model (e.g.,"ryzenai-llm","llamacpp","flm")size- Model size in GB (omitted for models without size information)downloaded- Boolean indicating if the model is downloaded and available locallysuggested- Boolean indicating if the model is recommended for general uselabels- Array of tags describing the model (e.g.,"hot","reasoning","vision","embeddings","reranking","coding","tool-calling","image")image_defaults- (Image models only) Default generation parameters for the model:steps- Number of inference steps (e.g., 4 for turbo models, 20 for standard models)cfg_scale- Classifier-free guidance scale (e.g., 1.0 for turbo models, 7.5 for standard models)width- Default image width in pixelsheight- Default image height in pixels
GET /v1/models/{model_id}
Retrieve a specific model by its ID. Returns the same model object format as the list endpoint above.
Parameters
| Parameter | Required | Description |
|---|---|---|
model_id |
Yes | The ID of the model to retrieve. Must match one of the model IDs from the models list. |
Example request
curl http://localhost:13305/v1/models/Qwen3-0.6B-GGUF
Response format
Returns a single model object with the same fields as described in the models list endpoint above.
{
"id": "Qwen3-0.6B-GGUF",
"created": 1744173590,
"object": "model",
"owned_by": "lemonade",
"checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
"recipe": "llamacpp",
"size": 0.38,
"downloaded": true,
"suggested": true,
"labels": ["reasoning"],
"recipe_options" {
"ctx_size": 8192,
"llamacpp_args": "--no-mmap",
"llamacpp_backend": "rocm"
}
}
Error responses
If the model is not found, the endpoint returns a 404 error:
{
"error": {
"message": "Model Qwen3-0.6B-GGUF has not been found",
"type": "not_found"
}
}