Multi-Model Support
Lemonade supports loading multiple models simultaneously, allowing you to keep frequently-used models in memory for faster switching. The server uses a Least Recently Used (LRU) cache policy to automatically manage model eviction when limits are reached.
Configuration
Configure via lemonade config set max_loaded_models=N. See Server Configuration.
Default: 1 (one model of each type). Use -1 for unlimited.
Model Types
Models are categorized into these types:
- LLM - Chat and completion models (default type)
- Embedding - Models for generating text embeddings (identified by the
embeddingslabel) - Reranking - Models for document reranking (identified by the
rerankinglabel) - Audio - Models for audio transcription using Whisper (identified by the
audiolabel) - Image - Models for image generation (identified by the
imagelabel)
Each type has its own independent LRU cache, all sharing the same slot limit set by max_loaded_models.
Device Constraints
- NPU Exclusivity:
flm,ryzenai-llm, andwhispercppare mutually exclusive on the NPU.- Loading a model from one of these backends will automatically evict all NPU models from the other backends.
flmsupports loading 1 ASR model, 1 LLM, and 1 embedding model on the NPU at the same time.ryzenai-llmsupports loading exactly 1 LLM, which uses the entire NPU.whispercppsupports loading exactly 1 ASR model at a time, which uses the entire NPU.
- CPU/GPU: No inherent limits beyond available RAM. Multiple models can coexist on CPU or GPU.
Eviction Policy
When a model slot is full: 1. The least recently used model of that type is evicted 2. The new model is loaded 3. If loading fails (except file-not-found errors), all models are evicted and the load is retried
Models currently processing inference requests cannot be evicted until they finish.
Per-Model Settings
Each model can be loaded with custom settings (context size, llamacpp backend, llamacpp args) via the /api/v1/load endpoint. These per-model settings override the default values set via CLI arguments or environment variables. See the /api/v1/load endpoint documentation for details.
Setting Priority Order:
1. Values passed explicitly in /api/v1/load request (highest priority)
2. Values from environment variables or server startup arguments (see Server Configuration)
3. Hardcoded defaults in lemond (lowest priority)