Lemonade integrates vLLM as an experimental backend for AMD ROCm GPUs on Linux. vLLM brings two core benefits:
Status: experimental. The backend has been validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point). Prebuilt wheels also exist for
gfx110X(RDNA3) andgfx120X(RDNA4) but those targets have not been exercised end-to-end yet.
vLLM on AMD ROCm requires a kernel that exports the CWSR sysfs properties and an amdgpu setup that doesn’t shadow the built-in driver. Both are covered with verification commands and fixes on the Kernel Update Required page — that’s the canonical reference; the same prerequisites apply to llamacpp:rocm and sd-cpp:rocm-*. Lemonade blocks install of vllm:rocm on systems missing the kernel fix and points users at that page.
lemonade backends install vllm:rocm
Or via HTTP:
curl -X POST http://localhost:13305/api/v1/install \
-H 'Content-Type: application/json' \
-d '{"recipe": "vllm", "backend": "rocm"}'
The install fetches a per-GPU-target release (e.g. …-gfx1151, …-gfx1150) from lemonade-sdk/vllm-rocm. The base version is pinned in backend_versions.json; the -{gfx_target} suffix is appended at runtime from SystemInfo::get_rocm_arch(), so a single pin covers all supported architectures.
Models registered with the vllm recipe in server_models.json load automatically on first request. To register your own:
lemonade pull user.MyModel \
--checkpoint main Qwen/Qwen3-4B \
--recipe vllm
Standard OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions) work as usual. Lemonade forwards requests to the vLLM child process, which exposes the engine’s own private endpoints (e.g. /metrics, /version) on a backend-only port surfaced via GET /v1/health (backend_url field) — useful for observability but not proxied through Lemonade.
Free-form CLI args can be appended to vllm-server via vllm_args:
# Allow more concurrent sequences and turn on prefix caching
lemonade config set vllm_args="--max-num-seqs 128 --enable-prefix-caching"
Qwen/Qwen3-4B-FP8 took ~12 minutes in our test, exceeding Lemonade’s default wait_for_ready timeout. The engine selects TritonFp8BlockScaledMMKernel and emits “Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal.” warnings — i.e. no AMD-tuned kernel configs are shipped for this GPU’s exact shapes, so vLLM autotunes from defaults. FP16 is the most polished path today; FP8 should improve once AMD ships tuned configs.huggingface-hub shadowing. Lemonade launches vllm-server with PYTHONNOUSERSITE=1 so the bundled huggingface_hub is used. If a module-not-found error still appears, ensure ~/.local/lib/python3.12/site-packages/huggingface_hub isn’t being injected via PYTHONPATH.EngineCore may continue running in the background and hold VRAM until killed. Look for a VLLM::EngineCor process and kill -9 it before retrying.