lemonade

vLLM Backend Options

Lemonade integrates vLLM as an experimental backend for AMD ROCm GPUs on Linux. vLLM brings two core benefits:

  1. Day-0 model support. vLLM typically supports new transformer architectures within hours of their release on Hugging Face — checkpoints load directly, with no per-architecture porting.
  2. Concurrency and multi-GPU. Paged-attention KV cache, continuous batching, and chunked prefill scale aggregate throughput with in-flight request count; tensor and pipeline parallelism are supported across multiple GPUs.

Status: experimental. The backend has been validated on gfx1151 (Strix Halo) and gfx1150 (Strix Point). Prebuilt wheels also exist for gfx110X (RDNA3) and gfx120X (RDNA4) but those targets have not been exercised end-to-end yet.

Available Backend

ROCm

Prerequisites

vLLM on AMD ROCm requires a kernel that exports the CWSR sysfs properties and an amdgpu setup that doesn’t shadow the built-in driver. Both are covered with verification commands and fixes on the Kernel Update Required page — that’s the canonical reference; the same prerequisites apply to llamacpp:rocm and sd-cpp:rocm-*. Lemonade blocks install of vllm:rocm on systems missing the kernel fix and points users at that page.

Install

lemonade backends install vllm:rocm

Or via HTTP:

curl -X POST http://localhost:13305/api/v1/install \
  -H 'Content-Type: application/json' \
  -d '{"recipe": "vllm", "backend": "rocm"}'

The install fetches a per-GPU-target release (e.g. …-gfx1151, …-gfx1150) from lemonade-sdk/vllm-rocm. The base version is pinned in backend_versions.json; the -{gfx_target} suffix is appended at runtime from SystemInfo::get_rocm_arch(), so a single pin covers all supported architectures.

Use

Models registered with the vllm recipe in server_models.json load automatically on first request. To register your own:

lemonade pull user.MyModel \
  --checkpoint main Qwen/Qwen3-4B \
  --recipe vllm

Standard OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions) work as usual. Lemonade forwards requests to the vLLM child process, which exposes the engine’s own private endpoints (e.g. /metrics, /version) on a backend-only port surfaced via GET /v1/health (backend_url field) — useful for observability but not proxied through Lemonade.

Tuning

Free-form CLI args can be appended to vllm-server via vllm_args:

# Allow more concurrent sequences and turn on prefix caching
lemonade config set vllm_args="--max-num-seqs 128 --enable-prefix-caching"

Known gotchas