This page documents Lemonade’s llama.cpp-specific compatibility surface.
| Method | Endpoint | Description | Modality |
|---|---|---|---|
POST |
/v1/reranking |
Reranking | query + documents -> relevance-scored documents |
GET |
/v1/slots |
Returns the current slots processing state | slots state |
POST |
/v1/slots/{id}?action=save |
Save the prompt cache of the specified slot to a file | prompt cache |
POST |
/v1/slots/{id}?action=restore |
Restore the prompt cache of the specified slot from a file | prompt cache |
POST |
/v1/slots/{id}?action=erase |
Erase the prompt cache of the specified slot | prompt cache |
POST |
/v1/tokenize |
Tokenize a given text | tokenization |
POST /v1/rerankingReranking API for llama.cpp-compatible reranker models. You provide a query and a list of documents, and receive relevance scores for each document. Lemonade will load the requested model automatically if it is not already loaded.
Note: This endpoint is part of Lemonade’s llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp’s
/v1/rerankendpoint.
Note: This endpoint is only available for reranker-specific models using the
llamacpprecipe, such asbge-reranker-v2-m3-GGUF.
| Parameter | Required | Description | Status |
|---|---|---|---|
query |
Yes | The search query text. | |
documents |
Yes | Array of document strings to score against the query. | |
model |
Yes | The reranking model to use. If not already loaded, Lemonade loads it before forwarding the request. |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/reranking" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}' -UseBasicParsing
```
=== “Bash”
```bash
curl -X POST http://localhost:13305/v1/reranking \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}'
```
{
"model": "bge-reranker-v2-m3-GGUF",
"object": "list",
"results": [
{
"index": 0,
"relevance_score": 8.60673713684082
},
{
"index": 1,
"relevance_score": -5.3886260986328125
},
{
"index": 2,
"relevance_score": -3.555561065673828
}
],
"usage": {
"prompt_tokens": 51,
"total_tokens": 51
}
}
Field Descriptions:
model - Model identifier used for rerankingobject - Type of response object, always "list"results - Array of all input documents with relevance scores
index - Original index of the document in the input arrayrelevance_score - Relevance score assigned by the model; higher means more relevantusage - Token usage statistics
prompt_tokens - Number of tokens in the inputtotal_tokens - Total tokens processedNote: Results are returned in input order. To rank documents by relevance, sort
resultsbyrelevance_scorein descending order on the client side.
GET /v1/slotsReturns the current state of all processing slots in the llama.cpp server. Slots are parallel processing contexts that can handle multiple requests concurrently.
Note: This endpoint is part of Lemonade’s llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp’s
/slotsendpoint.
Note: This endpoint is only available when a llama.cpp model is loaded.
Note: This endpoint supports all four path prefixes:
/api/v0/slots,/api/v1/slots,/v0/slots, and/v1/slots.
This endpoint accepts no parameters.
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots" `
-Method GET -UseBasicParsing
```
=== “Bash”
```bash
curl http://localhost:13305/v1/slots
```
[
{
"id": 0,
"state": "idle",
"next_token": {
"has_next_token": false,
"n_remain": 0,
"n_decoded": 0
},
"task_id": -1,
"cache_tokens": 1024
},
{
"id": 1,
"state": "processing",
"next_token": {
"has_next_token": true,
"n_remain": 42,
"n_decoded": 15
},
"task_id": 123,
"cache_tokens": 512
}
]
Field Descriptions:
id - Unique identifier for the slotstate - Current processing state (“idle”, “processing”, etc.)next_token - Information about token generation state
has_next_token - Whether more tokens are expectedn_remain - Number of tokens remaining to generaten_decoded - Number of tokens already decodedtask_id - Identifier of the current task being processed (-1 if idle)cache_tokens - Number of cached tokens in the slot’s prompt cachePOST /v1/slots/{id}?action=saveSave the prompt cache of a specific slot to a file. This allows you to persist the current context state for later restoration.
Note: This endpoint is part of Lemonade’s llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp’s
/slots/{id}?action=saveendpoint.
Note: The llama.cpp server must be started with the
--slot-save-pathargument for save operations to work. See Server Configuration for details on configuring backend arguments.Example configuration:
lemonade config set llamacpp.args="--slot-save-path /path/to/slot/saves"
Note: This endpoint supports all four path prefixes:
/api/v0/slots/{id},/api/v1/slots/{id},/v0/slots/{id}, and/v1/slots/{id}.
| Parameter | Required | Description | Status |
|---|---|---|---|
id |
Yes | The slot ID to save (path parameter). | |
filename |
Yes | The filename where the slot cache should be saved (JSON body). |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots/0?action=save" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== “PowerShell (/api/v1)”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/slots/0?action=save" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== “Bash”
```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=save" \
-H "Content-Type: application/json" \
-d '{"filename": "my_conversation_cache.bin"}'
```
{
"id_slot": 0,
"filename": "my_conversation_cache.bin",
"n_saved": 1024
}
Field Descriptions:
id_slot - The slot ID that was savedfilename - The filename where the cache was savedn_saved - Number of tokens saved to the cache filePOST /v1/slots/{id}?action=restoreRestore the prompt cache of a specific slot from a previously saved file. This allows you to resume a conversation or context from where you left off.
Note: This endpoint is part of Lemonade’s llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp’s
/slots/{id}?action=restoreendpoint.
Note: The llama.cpp server must be started with the
--slot-save-pathargument for restore operations to work.
Note: This endpoint supports all four path prefixes:
/api/v0/slots/{id},/api/v1/slots/{id},/v0/slots/{id}, and/v1/slots/{id}.
| Parameter | Required | Description | Status |
|---|---|---|---|
id |
Yes | The slot ID to restore to (path parameter). | |
filename |
Yes | The filename from which to restore the slot cache (JSON body). |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots/0?action=restore" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== “PowerShell (/api/v1)”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/slots/0?action=restore" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== “Bash”
```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=restore" \
-H "Content-Type: application/json" \
-d '{"filename": "my_conversation_cache.bin"}'
```
{
"id_slot": 0,
"filename": "my_conversation_cache.bin",
"n_restored": 1024
}
Field Descriptions:
id_slot - The slot ID that was restoredfilename - The filename from which the cache was restoredn_restored - Number of tokens restored from the cache filePOST /v1/slots/{id}?action=eraseErase (clear) the prompt cache of a specific slot. This removes all cached context from the slot, resetting it to an empty state.
Note: This endpoint is part of Lemonade’s llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp’s
/slots/{id}?action=eraseendpoint.
Note: This endpoint supports all four path prefixes:
/api/v0/slots/{id},/api/v1/slots/{id},/v0/slots/{id}, and/v1/slots/{id}.
| Parameter | Required | Description | Status |
|---|---|---|---|
id |
Yes | The slot ID to erase (path parameter). |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots/0?action=erase" `
-Method POST -UseBasicParsing
```
=== “PowerShell (/api/v1)”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/slots/0?action=erase" `
-Method POST -UseBasicParsing
```
=== “Bash”
```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=erase"
```
{
"id_slot": 0
}
Field Descriptions:
id_slot - The slot ID that was erasedNote: If the server returns an error, it may indicate that the slot was not found or that the operation failed.
POST /v1/tokenizeTokenize a given text. Does not count towards the current model’s context window.
Note: This endpoint is part of Lemonade’s llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp’s
/tokenizeendpoint.
Note: This endpoint supports all four path prefixes:
/api/v0/tokenize,/api/v1/tokenize,/v0/tokenize, and/v1/tokenize.
Note: Actual response values may vary for the same string across different models if the models do not share the same tokenizer.
| Parameter | Required | Description | Status |
|---|---|---|---|
content |
Yes | The text to tokenize. | |
add_special |
No | Boolean indicating if special tokens, i.e. BOS, should be inserted. Default: false |
|
parse_special |
No | Boolean indicating if special tokens should be tokenized. When false special tokens are treated as plaintext. Default: true |
|
with_pieces |
No | Boolean indicating whether to return token pieces along with IDs. Default: false |
=== “PowerShell”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/tokenize" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"content": "This is a string to tokenize"}' -UseBasicParsing
```
=== “PowerShell (/api/v1)”
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/tokenize" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"content": "This is a string to tokenize"}' -UseBasicParsing
```
=== “Bash”
```bash
curl -X POST "http://localhost:13305/v1/tokenize" \
-H "Content-Type: application/json" \
-d '{"content": "This is a string to tokenize"}'
```
{
"tokens": [1919,369,264,886,310,74995]
}
If with_pieces is true:
{
"tokens": [
{"id": 123, "piece": "Hello"},
{"id": 456, "piece": " world"},
{"id": 789, "piece": "!"}
]
}
Field Descriptions:
tokens - Array of token IDs