vLLM Backend Options
Lemonade integrates vLLM as an experimental backend for AMD ROCm GPUs on Linux. vLLM brings two core benefits:
- Day-0 model support. vLLM typically supports new transformer architectures within hours of their release on Hugging Face — checkpoints load directly, with no per-architecture porting.
- Concurrency and multi-GPU. Paged-attention KV cache, continuous batching, and chunked prefill scale aggregate throughput with in-flight request count; tensor and pipeline parallelism are supported across multiple GPUs.
Status: experimental. The backend has been validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point). Prebuilt wheels also exist for
gfx110X(RDNA3) andgfx120X(RDNA4) but those targets have not been exercised end-to-end yet.
Available Backend
ROCm
- Platform: Linux only
- Hardware: validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point); prebuilt wheels also exist for gfx110X (RDNA3) and gfx120X (RDNA4)
- Bundle: a self-contained tarball from lemonade-sdk/vllm-rocm with a relocatable Python interpreter, PyTorch (ROCm), the ROCm user-space libs, Triton, and vLLM. No system Python / PyTorch / ROCm install is required on the host.
Prerequisites
vLLM on AMD ROCm requires a kernel that exports the CWSR sysfs properties and an amdgpu setup that doesn't shadow the built-in driver. Both are covered with verification commands and fixes on the Kernel Update Required page — that's the canonical reference; the same prerequisites apply to llamacpp:rocm and sd-cpp:rocm-*. Lemonade blocks install of vllm:rocm on systems missing the kernel fix and points users at that page.
Install
lemonade backends install vllm:rocm
Or via HTTP:
curl -X POST http://localhost:13305/api/v1/install \
-H 'Content-Type: application/json' \
-d '{"recipe": "vllm", "backend": "rocm"}'
The install fetches a per-GPU-target release (e.g. …-gfx1151, …-gfx1150) from lemonade-sdk/vllm-rocm. The base version is pinned in backend_versions.json; the -{gfx_target} suffix is appended at runtime from SystemInfo::get_rocm_arch(), so a single pin covers all supported architectures.
Use
Models registered with the vllm recipe in server_models.json load automatically on first request. To register your own:
lemonade pull user.MyModel \
--checkpoint main Qwen/Qwen3-4B \
--recipe vllm
Standard OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions) work as usual. Lemonade forwards requests to the vLLM child process, which exposes the engine's own private endpoints (e.g. /metrics, /version) on a backend-only port surfaced via GET /v1/health (backend_url field) — useful for observability but not proxied through Lemonade.
Tuning
Free-form CLI args can be appended to vllm-server via vllm_args:
# Allow more concurrent sequences and turn on prefix caching
lemonade config set vllm_args="--max-num-seqs 128 --enable-prefix-caching"
Known gotchas
- Cold first-load JIT. Loading a new model size triggers a Triton kernel compile. Expect 20 s – several minutes the first time you hit a given model+shape; subsequent loads of the same shape are faster as kernels cache to disk.
- FP8 first-load is slow on gfx1151. Cold-loading
Qwen/Qwen3-4B-FP8took ~12 minutes in our test, exceeding Lemonade's defaultwait_for_readytimeout. The engine selectsTritonFp8BlockScaledMMKerneland emits "Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal." warnings — i.e. no AMD-tuned kernel configs are shipped for this GPU's exact shapes, so vLLM autotunes from defaults. FP16 is the most polished path today; FP8 should improve once AMD ships tuned configs. huggingface-hubshadowing. Lemonade launchesvllm-serverwithPYTHONNOUSERSITE=1so the bundledhuggingface_hubis used. If a module-not-found error still appears, ensure~/.local/lib/python3.12/site-packages/huggingface_hubisn't being injected viaPYTHONPATH.- Long load times can leave orphaned processes if interrupted. If a load times out at the Lemonade level, vLLM's child
EngineCoremay continue running in the background and hold VRAM until killed. Look for aVLLM::EngineCorprocess andkill -9it before retrying.