Lemonade is an open-source SDK that provides high-level APIs, CLI tools, and a server interface to deploy and benchmark LLMs using ONNX Runtime GenAI (OGA), Hugging Face Transformers, and llama.cpp backends.
Lemonade Server is a component of the SDK that enables local LLM deployment via an OpenAI-compatible API. It allows integration with apps like chatbots and coding assistants without requiring code changes. It’s available as a standalone Windows GUI installer or via command line for Linux.
Visit https://lemonade-server.ai/install_options.html and click the options that apply to you.
For more information on Hybrid/NPU Support, see the section Hybrid/NPU.
Yes! To install Lemonade on Linux, visit https://lemonade-server.ai/ and check the “Developer Setup” section for installation instructions. Visit the Supported Configurations section to see the support matrix for CPU, GPU, and NPU.
To completely uninstall Lemonade Server from your system, follow these steps:
Step 1: Remove cached files
%USERPROFILE%\.cache
lemonade
folder if it existshuggingface
folderStep 2: Remove from PATH environment variable
Win + I
to open Windows Settingslemonade_server\bin
and select itStep 3: Delete installation folder
%LOCALAPPDATA%\lemonade_server
lemonade_server
folderLemonade supports a wide range of LLMs including LLaMA, DeepSeek, Qwen, Gemma, Phi, and gpt-oss. Most GGUF models can also be added to Lemonade Server by users using the Model Manager interface.
Model compatibility depends on your system’s RAM, VRAM, and NPU availability. The actual file size varies significantly between models due to different quantization techniques and architectures.
To check if a model will work:
amd/Qwen2.5-7B-Chat-awq-g128-int4-asym-fp16-onnx-hybrid
).If a model isn’t listed, it may not yet be validated or compatible with your selected backend (for example, Hybrid models will not show if Ryzen AI Hybrid software is not installed). You can:
Yes, there’s a guide on preparing your models for Ryzen AI NPU:
You can measure:
Yes! Lemonade Server exposes a /stats
endpoint that returns performance metrics from the most recent completion request:
curl http://localhost:8000/api/v1/stats
Or, you can launch lemonade-server
with the option --log-level debug
and that will also print out stats.
Lemonade supports llama.cpp as a backend, so performance is similar when using the same model and quantization.
Yes, hybrid inference is currently supported only on Windows. NPU-only inference is coming to Linux soon, followed by hybrid (NPU+iGPU) support via ROCm.
Yes. In hybrid mode:
Yes! Lemonade supports multiple execution modes:
While you won’t get NPU acceleration on non-Ryzen AI 300 systems, you can still benefit from GPU acceleration and the OpenAI-compatible API.
AMD publishes pre-quantized and optimized models in their Hugging Face collections:
To find the architecture of a specific model, click on any model in these collections and look for the “Base model” field, which will show you the underlying architecture (e.g., Llama, Qwen, Phi).
Make sure that you’ve put the NPU in “Turbo” mode to get the best results. This is done by opening a terminal window and running the following commands:
cd C:\Windows\System32\AMD
.\xrt-smi configure --pmode turbo
Check the Lemonade Server logs via the tray icon. Common issues include model compatibility or outdated versions.
Open a feature request on GitHub. We’re actively shaping the roadmap based on user feedback.
Yes! We tag roadmap items on GitHub with the “on roadmap” label.