lemonade

OpenAI-Compatible API

This spec defines Lemonade’s implementation of the OpenAI API.

Method Endpoint Description Modality
POST /v1/chat/completions Chat Completions messages -> completion
POST /v1/completions Text Completions prompt -> completion
POST /v1/embeddings Embeddings text -> vector representations
POST /v1/responses Responses API prompt/messages -> event
POST /v1/audio/transcriptions Audio Transcription audio file -> text
POST /v1/audio/speech Text to speech text -> audio
WS /realtime Realtime Audio Transcription, OpenAI SDK compatible streaming audio -> text
POST /v1/images/generations Image Generation prompt -> image
POST /v1/images/edits Image Editing image + prompt -> edited image
POST /v1/images/variations Image Variations image -> varied image
POST /v1/images/upscale Image Upscaling image + ESRGAN model -> upscaled image
GET /v1/models List models available locally n/a
GET /v1/models/{model_id} Retrieve a specific model by ID n/a

POST /v1/chat/completions

Status

Chat Completions API. You provide a list of messages and receive a completion. This API will also load the model if it is not already loaded.

Parameters

Parameter Required Description Status
messages Yes Array of messages in the conversation. Each message should have a role (“user” or “assistant”) and content (the message text). Status
model Yes The model to use for the completion. Status
stream No If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. Status
stop No Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. Status
logprobs No Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Status
temperature No What sampling temperature to use. Status
repeat_penalty No Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. Status
top_k No Integer that controls the number of top tokens to consider during sampling. Status
top_p No Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. Status
tools No A list of tools the model may call. Status
max_tokens No An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_completion_tokens. This value is now deprecated by OpenAI in favor of max_completion_tokens Status
max_completion_tokens No An upper bound for the number of tokens that can be generated for a completion. Mutually exclusive with max_tokens. Status

Example request

=== “PowerShell”

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/chat/completions" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{
    "model": "Qwen3-0.6B-GGUF",
    "messages": [
      {
        "role": "user",
        "content": "What is the population of Paris?"
      }
    ],
    "stream": false
  }'
``` === "Bash"

```bash
curl -X POST http://localhost:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen3-0.6B-GGUF",
        "messages": [
          {"role": "user", "content": "What is the population of Paris?"}
        ],
        "stream": false
      }'
```

Image understanding input format (OpenAI-compatible)

To send images to chat/completions, pass a messages[*].content array that mixes text and image_url items. The image can be provided as a base64 data URL (for example, from FileReader.readAsDataURL(...) in web apps).

curl -X POST http://localhost:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen2.5-VL-7B-Instruct",
        "messages": [
          {
            "role": "user",
            "content": [
              {"type": "text", "text": "What is in this image?"},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."}}
            ]
          }
        ],
        "stream": false
      }'

Response format

=== “Non-streaming responses”

```json
{
  "id": "0",
  "object": "chat.completion",
  "created": 1742927481,
  "model": "Qwen3-0.6B-GGUF",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Paris has a population of approximately 2.2 million people in the city proper."
    },
    "finish_reason": "stop"
  }]
}
``` === "Streaming responses"
For streaming responses, the API returns a stream of server-sent events (however, Open AI recommends using their streaming libraries for parsing streaming responses):

```json
{
  "id": "0",
  "object": "chat.completion.chunk",
  "created": 1742927481,
  "model": "Qwen3-0.6B-GGUF",
  "choices": [{
    "index": 0,
    "delta": {
      "role": "assistant",
      "content": "Paris"
    }
  }]
}
```

POST /v1/completions

Status

Text Completions API. You provide a prompt and receive a completion. This API will also load the model if it is not already loaded.

Parameters

Parameter Required Description Status
prompt Yes The prompt to use for the completion. Status
model Yes The model to use for the completion. Status
stream No If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. Status
stop No Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. Can be a string or an array of strings. Status
echo No Echo back the prompt in addition to the completion. Available on non-streaming mode. Status
logprobs No Include log probabilities of the output tokens. If true, returns the log probability of each output token. Defaults to false. Only available when stream=False. Status
temperature No What sampling temperature to use. Status
repeat_penalty No Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. Status
top_k No Integer that controls the number of top tokens to consider during sampling. Status
top_p No Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. Status
max_tokens No An upper bound for the number of tokens that can be generated for a completion, including input tokens. Status

Example request

=== “PowerShell”

```powershell
Invoke-WebRequest -Uri "http://localhost:13305/v1/completions" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{
    "model": "Qwen3-0.6B-GGUF",
    "prompt": "What is the population of Paris?",
    "stream": false
  }'
```

=== “Bash”

```bash
curl -X POST http://localhost:13305/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen3-0.6B-GGUF",
        "prompt": "What is the population of Paris?",
        "stream": false
      }'
```

Response format

The following format is used for both streaming and non-streaming responses:

{
  "id": "0",
  "object": "text_completion",
  "created": 1742927481,
  "model": "Qwen3-0.6B-GGUF",
  "choices": [{
    "index": 0,
    "text": "Paris has a population of approximately 2.2 million people in the city proper.",
    "finish_reason": "stop"
  }],
}

POST /v1/embeddings

Status

Embeddings API. You provide input text and receive vector representations (embeddings) that can be used for semantic search, clustering, and similarity comparisons. This API will also load the model if it is not already loaded.

Note: This endpoint is only available for models using the llamacpp or flm recipes. ONNX models (OGA recipes) do not support embeddings.

Parameters

Parameter Required Description Status
input Yes The input text or array of texts to embed. Can be a string or an array of strings. Status
model Yes The model to use for generating embeddings. Status
encoding_format No The format to return embeddings in. Supported values: "float" (default), "base64". Status

Example request

=== “PowerShell”

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/embeddings" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{
    "model": "nomic-embed-text-v1-GGUF",
    "input": ["Hello, world!", "How are you?"],
    "encoding_format": "float"
  }'
```

=== “Bash”

```bash
curl -X POST http://localhost:13305/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
        "model": "nomic-embed-text-v1-GGUF",
        "input": ["Hello, world!", "How are you?"],
        "encoding_format": "float"
      }'
```

Response format

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0234, -0.0567, 0.0891, ...]
    },
    {
      "object": "embedding",
      "index": 1,
      "embedding": [0.0456, -0.0678, 0.1234, ...]
    }
  ],
  "model": "nomic-embed-text-v1-GGUF",
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12
  }
}

Field Descriptions:

POST /v1/responses

Status

Responses API. You provide an input and receive a response. This API will also load the model if it is not already loaded.

Parameters

Parameter Required Description Status
input Yes A list of dictionaries or a string input for the model to respond to. Status
model Yes The model to use for the response. Status
max_output_tokens No The maximum number of output tokens to generate. Status
temperature No What sampling temperature to use. Status
repeat_penalty No Number between 1.0 and 2.0. 1.0 means no penalty. Higher values discourage repetition. Status
top_k No Integer that controls the number of top tokens to consider during sampling. Status
top_p No Float between 0.0 and 1.0 that controls the cumulative probability of top tokens to consider during nucleus sampling. Status
stream No If true, tokens will be sent as they are generated. If false, the response will be sent as a single message once complete. Defaults to false. Status

Streaming Events

The Responses API uses semantic events for streaming. Each event is typed with a predefined schema, so you can listen for events you care about. Our initial implementation only offers support to:

For a full list of event types, see the API reference for streaming.

Example request

=== “PowerShell”

```powershell
Invoke-WebRequest -Uri "http://localhost:13305/v1/responses" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{
    "model": "Llama-3.2-1B-Instruct-Hybrid",
    "input": "What is the population of Paris?",
    "stream": false
  }'
```

=== “Bash”

```bash
curl -X POST http://localhost:13305/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Llama-3.2-1B-Instruct-Hybrid",
        "input": "What is the population of Paris?",
        "stream": false
      }'
```

Response format

=== “Non-streaming responses”

```json
{
  "id": "0",
  "created_at": 1746225832.0,
  "model": "Llama-3.2-1B-Instruct-Hybrid",
  "object": "response",
  "output": [{
    "id": "0",
    "content": [{
      "annotations": [],
      "text": "Paris has a population of approximately 2.2 million people in the city proper."
    }]
  }]
}
```

=== “Streaming Responses” For streaming responses, the API returns a series of events. Refer to OpenAI streaming guide for details.

POST /v1/audio/transcriptions

Status

Audio Transcription API. You provide an audio file and receive a text transcription. This API will also load the model if it is not already loaded.

Note: This endpoint uses whisper.cpp as the backend. Whisper models are automatically downloaded when first used.

Limitations: Only wav audio format and json response format are currently supported.

Parameters

Parameter Required Description Status
file Yes The audio file to transcribe. Supported formats: wav. Status
model Yes The Whisper model to use for transcription (e.g., Whisper-Tiny, Whisper-Base, Whisper-Small). Status
language No The language of the audio (ISO 639-1 code, e.g., en, es, fr). If not specified, Whisper will auto-detect the language. Status
response_format No The format of the response. Currently only json is supported. Status

Example request

=== “Windows”

```bash
curl -X POST http://localhost:13305/v1/audio/transcriptions ^
  -F "file=@C:\path\to\audio.wav" ^
  -F "model=Whisper-Tiny"
```

=== “Linux”

```bash
curl -X POST http://localhost:13305/v1/audio/transcriptions \
  -F "file=@/path/to/audio.wav" \
  -F "model=Whisper-Tiny"
```

Response format

{
  "text": "Hello, this is a sample transcription of the audio file."
}

Field Descriptions:

WS /realtime

Status

Realtime Audio Transcription API via WebSocket (OpenAI SDK compatible). Stream audio from a microphone and receive transcriptions in real-time with Voice Activity Detection (VAD).

Limitations: Only 16kHz mono PCM16 audio format is supported. Uses the same Whisper models as the HTTP transcription endpoint.

Connection

The WebSocket server runs on a dynamically assigned port. Discover the port via the /v1/health endpoint (websocket_port field), then connect with the model name:

ws://localhost:<websocket_port>/realtime?model=Whisper-Tiny

Upon connection, the server sends a session.created message with a session ID.

Client → Server Messages

Message Type Description
session.update Configure the session (set model, VAD settings, or disable turn detection)
input_audio_buffer.append Send audio data (base64-encoded PCM16)
input_audio_buffer.commit Force transcription of buffered audio
input_audio_buffer.clear Clear audio buffer without transcribing

Server → Client Messages

Message Type Description
session.created Session established, contains session ID
session.updated Session configuration updated
input_audio_buffer.speech_started VAD detected speech start
input_audio_buffer.speech_stopped VAD detected speech end, transcription triggered
input_audio_buffer.committed Audio buffer committed for transcription
input_audio_buffer.cleared Audio buffer cleared
conversation.item.input_audio_transcription.delta Interim/partial transcription (replaceable)
conversation.item.input_audio_transcription.completed Final transcription result
error Error message

Example: Configure Session

{
  "type": "session.update",
  "session": {
    "model": "Whisper-Tiny"
  }
}

Example: Send Audio

{
  "type": "input_audio_buffer.append",
  "audio": "<base64-encoded PCM16 audio>"
}

Audio should be:

Example: Transcription Result

{
  "type": "conversation.item.input_audio_transcription.completed",
  "transcript": "Hello, this is a test transcription."
}

VAD Configuration

VAD settings can be configured via session.update:

{
  "type": "session.update",
  "session": {
    "model": "Whisper-Tiny",
    "turn_detection": {
      "threshold": 0.01,
      "silence_duration_ms": 800,
      "prefix_padding_ms": 250
    }
  }
}
Parameter Default Description
threshold 0.01 RMS energy threshold for speech detection
silence_duration_ms 800 Silence duration to trigger speech end
prefix_padding_ms 250 Minimum speech duration before triggering

Set turn_detection to null to disable server-side VAD and use explicit commits instead:

{
  "type": "session.update",
  "session": {
    "model": "Whisper-Tiny",
    "turn_detection": null
  }
}

Code Examples

See the examples/ directory for a complete, runnable example:

# Stream from microphone
python examples/realtime_transcription.py --model Whisper-Tiny

Integration Notes

POST /v1/images/generations

Status

Image Generation API. You provide a text prompt and receive a generated image. This API uses stable-diffusion.cpp as the backend.

Note: Image generation uses Stable Diffusion models. Available models include SD-Turbo (fast, ~4 steps), SDXL-Turbo, SD-1.5, and SDXL-Base-1.0.

Performance: CPU inference takes ~4-5 minutes per image. GPU (Vulkan) is faster but may have compatibility issues with some hardware.

Parameters

Parameter Required Description Status
prompt Yes The text description of the image to generate. Status
model Yes The Stable Diffusion model to use (e.g., SD-Turbo, SDXL-Turbo). Status
size No The size of the generated image. Format: WIDTHxHEIGHT (e.g., 512x512, 256x256). Default: 512x512. Status
n No Number of images to generate. Currently only 1 is supported. Status
response_format No Format of the response. Only b64_json (base64-encoded image) is supported. Status
steps No Number of inference steps. SD-Turbo works well with 4 steps. Default varies by model. Status
cfg_scale No Classifier-free guidance scale. SD-Turbo uses low values (~1.0). Default varies by model. Status
seed No Random seed for reproducibility. If not specified, a random seed is used. Status

Example request

=== “Bash”

```bash
curl -X POST http://localhost:13305/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
        "model": "SD-Turbo",
        "prompt": "A serene mountain landscape at sunset",
        "size": "512x512",
        "steps": 4,
        "response_format": "b64_json"
      }'
```

POST /v1/images/edits

Status

Image Editing API. You provide a source image and a text prompt describing the desired change, and receive an edited image. This API uses stable-diffusion.cpp as the backend.

Note: This endpoint accepts multipart/form-data requests (not JSON). Use editing-capable models such as Flux-2-Klein-4B or SD-Turbo.

Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.

Parameters

Parameter Required Description Status
model Yes The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). Status
image Yes The source image file to edit (PNG). Sent as a file in multipart/form-data. Status
prompt Yes A text description of the desired edit. Status
mask No An optional mask image (PNG). White areas indicate regions to edit; black areas are preserved. Status
size No The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. Status
n No Number of images to generate. Allowed range: 110. Default: 1. Values outside this range are rejected with 400 Bad Request. Status
response_format No Format of the response. Only b64_json (base64-encoded image) is supported. Status
steps No Number of inference steps. Default varies by model. Status
cfg_scale No Classifier-free guidance scale. Default varies by model. Status
seed No Random seed for reproducibility. Status
user No OpenAI API compatibility field. Accepted but not forwarded to the backend. Status
background No OpenAI API compatibility field. Accepted but not forwarded to the backend. Status
quality No OpenAI API compatibility field. Accepted but not forwarded to the backend. Status
input_fidelity No OpenAI API compatibility field. Accepted but not forwarded to the backend. Status
output_compression No OpenAI API compatibility field. Accepted; silently ignored by the backend. Status

Example request

=== “Bash”

```bash
curl -X POST http://localhost:13305/v1/images/edits \
  -F "model=Flux-2-Klein-4B" \
  -F "prompt=Add a red barn and mountains in the background, photorealistic" \
  -F "size=512x512" \
  -F "n=1" \
  -F "response_format=b64_json" \
  -F "image=@/path/to/source_image.png"
```

=== “Python (OpenAI client)”

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
    response = client.images.edit(
        model="Flux-2-Klein-4B",
        image=image_file,
        prompt="Add a red barn and mountains in the background, photorealistic",
        size="512x512",
    )
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("edited_image.png", "wb").write(image_data)
```

POST /v1/images/variations

Status

Image Variations API. You provide a source image and receive a variation of it. This API uses stable-diffusion.cpp as the backend.

Note: This endpoint accepts multipart/form-data requests (not JSON). Unlike /images/edits, a prompt parameter is not supported and will be ignored — the model generates a variation based solely on the input image.

Performance: CPU inference takes several minutes per image. GPU (ROCm) is significantly faster.

Parameters

Parameter Required Description Status
model Yes The Stable Diffusion model to use (e.g., Flux-2-Klein-4B, SD-Turbo). Status
image Yes The source image file (PNG). Sent as a file in multipart/form-data. Status
size No The size of the output image. Format: WIDTHxHEIGHT (e.g., 512x512). Default: 512x512. Status
n No Number of variations to generate. Integer between 1 and 10 inclusive. Default: 1. Values outside this range result in a 400 Bad Request error. Status
response_format No Format of the response. Only b64_json (base64-encoded image) is supported. Status
user No OpenAI API compatibility field. Accepted but not forwarded to the backend. Status

Example request

=== “Bash”

```bash
curl -X POST http://localhost:13305/v1/images/variations \
  -F "model=Flux-2-Klein-4B" \
  -F "size=512x512" \
  -F "n=1" \
  -F "response_format=b64_json" \
  -F "image=@/path/to/source_image.png"
```

=== “Python (OpenAI client)”

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="not-needed")
with open("source_image.png", "rb") as image_file:
    response = client.images.create_variation(
        model="Flux-2-Klein-4B",
        image=image_file,
        size="512x512",
        n=1,
    )
import base64
image_data = base64.b64decode(response.data[0].b64_json)
open("variation.png", "wb").write(image_data)
```

POST /v1/images/upscale

Status

Image Upscaling API. You provide a base64-encoded image and a Real-ESRGAN model name, and receive a 4x upscaled image. This API uses the sd-cli binary from stable-diffusion.cpp to perform super-resolution.

Note: Available upscale models are RealESRGAN-x4plus (general-purpose, 64 MB) and RealESRGAN-x4plus-anime (optimized for anime-style art, 17 MB). Both produce a 4x resolution increase (e.g., 256x256 → 1024x1024).

Note: Unlike /images/edits and /images/variations, this endpoint accepts a JSON body (not multipart/form-data). The image must be provided as a base64-encoded string.

Parameters

Parameter Required Description Status
image Yes Base64-encoded PNG image to upscale. Status
model Yes The ESRGAN model to use (e.g., RealESRGAN-x4plus, RealESRGAN-x4plus-anime). Status

Example request

A typical workflow is to generate an image first, then upscale it:

=== “Bash”

```bash
# Step 1: Generate an image and save the base64 response
RESPONSE=$(curl -s -X POST http://localhost:13305/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
        "model": "SD-Turbo",
        "prompt": "A serene mountain landscape at sunset",
        "size": "512x512",
        "steps": 4,
        "response_format": "b64_json"
      }')

# Step 2: Build the upscale JSON payload and pipe it to curl via stdin
# (base64 images are too large for command-line interpolation)
echo "$RESPONSE" | python3 -c "
import sys, json
b64 = json.load(sys.stdin)['data'][0]['b64_json']
print(json.dumps({'image': b64, 'model': 'RealESRGAN-x4plus'}))
" | curl -X POST http://localhost:13305/v1/images/upscale \
  -H "Content-Type: application/json" \
  -d @-
```

=== “PowerShell”

```powershell
# Step 1: Generate an image
$genResponse = Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/images/generations" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{
    "model": "SD-Turbo",
    "prompt": "A serene mountain landscape at sunset",
    "size": "512x512",
    "steps": 4,
    "response_format": "b64_json"
  }'

# Step 2: Extract the base64 image
$imageB64 = ($genResponse.Content | ConvertFrom-Json).data[0].b64_json

# Step 3: Upscale the image with Real-ESRGAN
$body = @{ image = $imageB64; model = "RealESRGAN-x4plus" } | ConvertTo-Json
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/images/upscale" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body $body
```

=== “Python (requests)”

```python
import requests
import base64

BASE_URL = "http://localhost:13305/api/v1"

# Step 1: Generate an image
gen_response = requests.post(f"{BASE_URL}/images/generations", json={
    "model": "SD-Turbo",
    "prompt": "A serene mountain landscape at sunset",
    "size": "512x512",
    "steps": 4,
    "response_format": "b64_json",
})
image_b64 = gen_response.json()["data"][0]["b64_json"]

# Step 2: Upscale the image with Real-ESRGAN (512x512 -> 2048x2048)
upscale_response = requests.post(f"{BASE_URL}/images/upscale", json={
    "image": image_b64,
    "model": "RealESRGAN-x4plus",
})

# Step 3: Save the upscaled image to a file
upscaled_b64 = upscale_response.json()["data"][0]["b64_json"]
with open("upscaled.png", "wb") as f:
    f.write(base64.b64decode(upscaled_b64))
```

Response format

{
  "created": 1742927481,
  "data": [
    {
      "b64_json": "<base64-encoded upscaled PNG>"
    }
  ]
}

Field Descriptions:

Error responses

Status Code Condition Example
400 Missing image field {"error": {"message": "Missing 'image' field (base64 encoded)", "type": "invalid_request_error"}}
400 Missing model field {"error": {"message": "Missing 'model' field", "type": "invalid_request_error"}}
404 Unknown model name {"error": {"message": "Upscale model not found: bad-model", "type": "invalid_request_error"}}
500 Upscale failed {"error": {"message": "ESRGAN upscale failed", "type": "server_error"}}

POST /v1/audio/speech

Status

Speech Generation API. You provide a text input and receive an audio file. This API uses Kokoros as the backend.

Note: The model to use is called kokoro-v1. No other model is supported at the moment.

Limitations: Only mp3, wav, opus, and pcm are supported. Streaming is supported in audio (pcm) mode.

Parameters

Parameter Required Description Status
input Yes The text to speak. Status
model Yes The model to use (e.g., kokoro-v1). Status
speed No Speaking speed. Default: 1.0. Status
voice No The voice to use. All OpenAI-defined voices can be used (alloy, ash, …), as well as those defined by the kokoro model (af_sky, am_echo, …). Default: shimmer Status
response_format No Format of the response. mp3, wav, opus, and pcm are supported. Default: mp3 Status
stream_format No If set, the response will be streamed. Only audio is supported, which will output pcm audio. Default: not set Status

Example request

=== “Bash”

```bash
curl -X POST http://localhost:13305/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "kokoro-v1",
        "input": "Lemonade can speak!",
        "speed": 1.0,
        "steps": 4,
        "response_format": "mp3"
      }'
```

Response format

The generated audio file is returned as-is.

GET /v1/models

Status

Returns a list of models available on the server in an OpenAI-compatible format. Each model object includes extended fields like checkpoint, recipe, size, downloaded, and labels.

By default, only models available locally (downloaded) are shown, matching OpenAI API behavior.

Parameters

Parameter Required Description
show_all No If set to true, returns all models from the catalog including those not yet downloaded. Defaults to false.

Example request

# Show only downloaded models (OpenAI-compatible)
curl http://localhost:13305/v1/models

# Show all models including not-yet-downloaded (extended usage)
curl http://localhost:13305/v1/models?show_all=true

Response format

{
  "object": "list",
  "data": [
    {
      "id": "Qwen3-0.6B-GGUF",
      "created": 1744173590,
      "object": "model",
      "owned_by": "lemonade",
      "checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
      "recipe": "llamacpp",
      "size": 0.38,
      "downloaded": true,
      "suggested": true,
      "labels": ["reasoning"]
    },
    {
      "id": "Gemma-3-4b-it-GGUF",
      "created": 1744173590,
      "object": "model",
      "owned_by": "lemonade",
      "checkpoint": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
      "recipe": "llamacpp",
      "size": 3.61,
      "downloaded": true,
      "suggested": true,
      "labels": ["hot", "vision"]
    },
    {
      "id": "SD-Turbo",
      "created": 1744173590,
      "object": "model",
      "owned_by": "lemonade",
      "checkpoint": "stabilityai/sd-turbo:sd_turbo.safetensors",
      "recipe": "sd-cpp",
      "size": 5.2,
      "downloaded": true,
      "suggested": true,
      "labels": ["image"],
      "image_defaults": {
        "steps": 4,
        "cfg_scale": 1.0,
        "width": 512,
        "height": 512
      }
    }
  ]
}

Field Descriptions:

GET /v1/models/{model_id}

Status

Retrieve a specific model by its ID. Returns the same model object format as the list endpoint above.

Parameters

Parameter Required Description
model_id Yes The ID of the model to retrieve. Must match one of the model IDs from the models list.

Example request

curl http://localhost:13305/v1/models/Qwen3-0.6B-GGUF

Response format

Returns a single model object with the same fields as described in the models list endpoint above.

{
  "id": "Qwen3-0.6B-GGUF",
  "created": 1744173590,
  "object": "model",
  "owned_by": "lemonade",
  "checkpoint": "unsloth/Qwen3-0.6B-GGUF:Q4_0",
  "recipe": "llamacpp",
  "size": 0.38,
  "downloaded": true,
  "suggested": true,
  "labels": ["reasoning"],
  "recipe_options" {
    "ctx_size": 8192,
    "llamacpp_args": "--no-mmap",
    "llamacpp_backend": "rocm"
  }
}

Error responses

If the model is not found, the endpoint returns a 404 error:

{
  "error": {
    "message": "Model Qwen3-0.6B-GGUF has not been found",
    "type": "not_found"
  }
}