Skip to content

🍋 Lemonade API: Model Compatibility and Recipes

Lemonade API (lemonade.api) provides a simple, high-level interface to load and run LLM models locally. This guide helps you understand what models work with which recipes, what to expect in terms of compatibility, and how to choose the right setup for your hardware.

🧠 What Is a Recipe?

A recipe defines how a model is run — including backend (e.g., PyTorch, ONNX Runtime), quantization strategy, and device support. The from_pretrained() function in lemonade.api uses the recipe to configure everything automatically. For the list of recipes, see Recipe Compatibility Table. The following is an example of using the Lemonade API from_pretrained() function:

from lemonade.api import from_pretrained

model, tokenizer = from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="hf-cpu")

Function Arguments: - checkpoint: The Hugging Face or OGA checkpoint that defines the LLM. - recipe: Defines the implementation and hardware used for the LLM. Default is "hf-cpu".

📜 Supported Model Formats

Lemonade API currently supports:

  • Hugging Face hosted safetensors checkpoints
  • AMD OGA (ONNXRuntime-GenAI) ONNX checkpoints

🍴 Recipe and Checkpoint Compatibility

The following table explains what checkpoints work with each recipe, the hardware and OS requirements, and additional notes:

Recipe Checkpoint Format Hardware Needed Operating System Notes
hf-cpu safetensors (Hugging Face) Any x86 CPU Windows, Linux Compatible with x86 CPUs, offering broad accessibility.
hf-dgpu safetensors (Hugging Face) Compatible Discrete GPU Windows, Linux Requires PyTorch and a compatible GPU.[1]
oga-cpu safetensors (Hugging Face) Any x86 CPU Windows Converted from safetensors via `model_builder`. Accuracy loss due to RTN quantization.
OGA ONNX Any x86 CPU Windows Use models from the CPU Collection.
OGA ONNX AMD Ryzen AI PC Windows Use models from the GPU Collection.
oga-hybrid Pre-quantized OGA ONNX AMD Ryzen AI 300 series PC Windows Use models from the Hybrid Collection. Optimized with AWQ to INT4.
oga-npu Pre-quantized OGA ONNX AMD Ryzen AI 300 series PC Windows Use models from the NPU Collection. Optimized with AWQ to INT4.

[1] Compatible GPUs are those that support PyTorch's .to("cuda") function. Ensure you have the appropriate version of PyTorch installed (e.g., CUDA or ROCm) for your specific GPU. Note: Lemonade does not install PyTorch with CUDA or ROCm for you. For installation instructions, see PyTorch's Get Started Guide.

🔄 Converting Models to OGA

Lemonade API will do the conversion for you using OGA's model_builder if you pass a safetensors checkpoint.

  • Takes \~1–5 minutes per model.
  • Uses RTN quantization (int4).
  • For better quality, use pre-quantized models (see below).

📦 Pre-Converted OGA Models

You can skip the conversion step by using pre-quantized models from AMD’s Hugging Face collection. These models are optimized using Activation Aware Quantization (AWQ), which provides higher-accuracy int4 quantization compared to RTN.

Recipe Collection
oga-hybrid Hybrid Collection
oga-npu NPU Collection
oga-cpu CPU Collection

📚 Additional Resources