🍋 Lemonade API: Model Compatibility and Recipes
Lemonade API (lemonade.api
) provides a simple, high-level interface to load and run LLM models locally. This guide helps you understand what models work with which recipes, what to expect in terms of compatibility, and how to choose the right setup for your hardware.
🧠 What Is a Recipe?
A recipe defines how a model is run — including backend (e.g., PyTorch, ONNX Runtime), quantization strategy, and device support. The from_pretrained()
function in lemonade.api
uses the recipe to configure everything automatically. For the list of recipes, see Recipe Compatibility Table. The following is an example of using the Lemonade API from_pretrained()
function:
from lemonade.api import from_pretrained
model, tokenizer = from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="hf-cpu")
Function Arguments: - checkpoint: The Hugging Face or OGA checkpoint that defines the LLM. - recipe: Defines the implementation and hardware used for the LLM. Default is "hf-cpu".
📜 Supported Model Formats
Lemonade API currently supports:
- Hugging Face hosted safetensors checkpoints
- AMD OGA (ONNXRuntime-GenAI) ONNX checkpoints
🍴 Recipe and Checkpoint Compatibility
The following table explains what checkpoints work with each recipe, the hardware and OS requirements, and additional notes:
Recipe | Checkpoint Format | Hardware Needed | Operating System | Notes |
---|---|---|---|---|
hf-cpu |
safetensors (Hugging Face) | Any x86 CPU | Windows, Linux | Compatible with x86 CPUs, offering broad accessibility. |
hf-dgpu |
safetensors (Hugging Face) | Compatible Discrete GPU | Windows, Linux | Requires PyTorch and a compatible GPU.[1] |
oga-cpu |
safetensors (Hugging Face) | Any x86 CPU | Windows | Converted from safetensors via `model_builder`. Accuracy loss due to RTN quantization. |
OGA ONNX | Any x86 CPU | Windows | Use models from the CPU Collection. | |
OGA ONNX | AMD Ryzen AI PC | Windows | Use models from the GPU Collection. | |
oga-hybrid |
Pre-quantized OGA ONNX | AMD Ryzen AI 300 series PC | Windows | Use models from the Hybrid Collection. Optimized with AWQ to INT4. |
oga-npu |
Pre-quantized OGA ONNX | AMD Ryzen AI 300 series PC | Windows | Use models from the NPU Collection. Optimized with AWQ to INT4. |
[1] Compatible GPUs are those that support PyTorch's .to("cuda")
function. Ensure you have the appropriate version of PyTorch installed (e.g., CUDA or ROCm) for your specific GPU. Note: Lemonade does not install PyTorch with CUDA or ROCm for you. For installation instructions, see PyTorch's Get Started Guide.
🔄 Converting Models to OGA
Lemonade API will do the conversion for you using OGA's model_builder
if you pass a safetensors checkpoint.
- Takes \~1–5 minutes per model.
- Uses RTN quantization (int4).
- For better quality, use pre-quantized models (see below).
📦 Pre-Converted OGA Models
You can skip the conversion step by using pre-quantized models from AMD’s Hugging Face collection. These models are optimized using Activation Aware Quantization (AWQ), which provides higher-accuracy int4 quantization compared to RTN.
Recipe | Collection |
---|---|
oga-hybrid |
Hybrid Collection |
oga-npu |
NPU Collection |
oga-cpu |
CPU Collection |