llama.cpp Backend Options
Lemonade uses llama.cpp as its primary LLM inference backend, supporting multiple hardware acceleration options. This document explains the available backends and how to choose between them.
Available Backends
CPU
- Platform: Windows, Linux, macOS
- Hardware: x86_64 processors (Windows and Linux); ARM64/aarch64 processors (Linux)
- Use Case: Universal fallback, no GPU required
- Performance: Slowest option, suitable for small models or testing
- Installation: Automatically available via upstream llama.cpp releases
Vulkan
- Platform: Windows, Linux
- Hardware: AMD GPUs (iGPU and dGPU), NVIDIA GPUs, Intel GPUs, Qualcomm Adreno and other Vulkan-capable GPUs on ARM64 Linux
- Use Case: Cross-vendor GPU acceleration
- Performance: Good performance across all GPU vendors
- Installation: Automatically available via upstream llama.cpp releases
- Notes: Recommended for most GPU users; on ARM64 Linux (e.g., Qualcomm X Elite), Vulkan is the default backend
ROCm
- Platform: Windows, Linux
- Hardware: AMD Radeon RX 6000/7000 series (RDNA2/RDNA3/RDNA4), AMD Ryzen AI iGPUs (Strix Point/Halo)
- Use Case: AMD GPU-optimized inference
- Performance: Optimized for AMD hardware, may outperform Vulkan on supported GPUs
- Channel Options:
- Stable (default): Custom builds with latest optimizations from lemonade-sdk
- Nightly: Bleeding-edge builds from lemonade-sdk/llamacpp-rocm (experimental)
- Installation: Varies by channel (see below)
CUDA
- Platform: Windows, Linux
- Hardware: NVIDIA GPUs with Compute Capability 7.5+ (Turing, Ampere, Ada, Hopper, Blackwell)
- Use Case: NVIDIA GPU-optimized inference
- Performance: Optimized for NVIDIA hardware, typically outperforms Vulkan on supported GPUs
- Source: Per-architecture builds from lemonade-sdk/llama.cpp
- Binaries: Compute-capability-specific builds (sm_75, sm_80, sm_86, sm_89, sm_90, sm_100, sm_120)
- Runtime: Bundled CUDA runtime libraries (no system-wide CUDA toolkit installation required)
- Notes: On Windows, .7z extraction requires the bsdtar bundled with Windows 11 22H2+. On Linux, the build is shipped as .tar.xz and extracts with the system
tar.
Metal
- Platform: macOS only
- Hardware: Apple Silicon (M1/M2/M3/M4) and Intel Macs with Metal support
- Use Case: macOS GPU acceleration
- Performance: Optimized for Apple Silicon
- Installation: Automatically available via upstream llama.cpp releases
System
- Platform: Linux only
- Hardware: Depends on system-installed llama-server binary
- Use Case: Advanced users with custom llama.cpp builds
- Performance: Depends on build configuration
- Installation: Requires manual installation of
llama-serverin system PATH - Enable: Set
"llamacpp": { "backend": "system" }inconfig.jsonto enable the system backend, or uselemonade config set llamacpp.backend=system - HIP plugin on non-standard paths: When an AMD GPU is present, the system backend needs the GGML HIP plugin (
libggml-hip.so). Lemonade looks for it in the standard system library paths. If your distribution or package manager installs it elsewhere (e.g. NixOS, a custom prefix, or a manual build), setLEMONADE_GGML_HIP_PATHto the full path of the plugin so the backend is reported as available:
export LEMONADE_GGML_HIP_PATH=/opt/rocm/lib/libggml-hip.so
The filename must look like libggml-hip*.so* (versioned sonames such as libggml-hip.so.0 are accepted). This Linux-only variable is used solely to detect plugin availability; it is not forwarded to the GGML loader, so it does not change where llama.cpp actually loads the plugin from.
ROCm Channel Configuration
The ROCm backend supports three channels to balance stability, performance, and access to latest features:
Stable Channel (Default)
{
"rocm_channel": "stable"
}
Nightly Channel
{
"rocm_channel": "nightly"
}
Changing Channels
To switch between channels, update your config.json:
{
"rocm_channel": "stable"
}
Or use the Lemonade CLI:
# Switch to stable channel
lemonade config set rocm_channel=stable
# Switch to stable channel (default)
lemonade config set rocm_channel=stable
# Switch to nightly channel (experimental)
lemonade config set rocm_channel=nightly
After changing channels, you'll need to reinstall the ROCm backend:
lemonade backends install llamacpp:rocm
Pinning to a Specific Version Tag
You can pin llamacpp.rocm_bin to a specific release tag instead of using "builtin" or "latest". Each channel downloads from a different GitHub repository, so you must set the correct channel before setting a specific tag.
| Channel | Repository | Tag format |
|---|---|---|
stable (default) |
lemonade-sdk/llama.cpp | Lemonade-specific build tags |
nightly |
lemonade-sdk/llamacpp-rocm | Nightly tags, e.g. b1260 |
Always set
rocm_channelto the correct channel before settingrocm_binto a specific tag. If the tag does not exist in the current channel's repository, the download will fail with HTTP 404.
Example — pin to a specific nightly build:
# 1. Switch to the nightly channel first
lemonade config set rocm_channel=nightly
# 2. Then pin to the desired nightly tag
lemonade config set llamacpp.rocm_bin=b1260
Choosing the Right Backend
Decision Tree
- Do you have an NVIDIA GPU (Turing or newer)?
- Try CUDA first for best performance
-
Fall back to Vulkan if you encounter issues
-
Do you have an AMD GPU?
- For Radeon RX 6000/7000 or Ryzen AI iGPU:
- Try ROCm first for best performance
- Fall back to Vulkan if you encounter issues
-
For older AMD GPUs (RX 5000 and earlier):
- Use Vulkan (ROCm not supported)
-
Do you have an Intel GPU or older NVIDIA GPU?
-
Use Vulkan
-
Do you have Apple Silicon?
-
Use Metal
-
No GPU or unsupported GPU?
- Use CPU
ROCm Channel Selection
- Use Stable if you:
- Prefer stability over latest features
- Want upstream llama.cpp compatibility
-
Are deploying in production
-
Use Nightly if you:
- Want bleeding-edge experimental features
- Are testing unreleased llama.cpp functionality
- Are comfortable with potential bugs and instability
- Are a developer contributing to lemonade or llama.cpp
Platform Specifics
Linux
- All backends supported (CPU, Vulkan, ROCm, CUDA, System)
- CPU and Vulkan backends support both x86_64 and ARM64 (aarch64) systems; on ARM64, Vulkan is the default
- ROCm requires compatible AMD GPU (see above)
- CUDA requires compatible NVIDIA GPU (see above)
- System backend requires manual llama-server installation
Windows
- Supported: CPU, Vulkan, ROCm, CUDA
- ROCm requires compatible AMD GPU
- CUDA requires compatible NVIDIA GPU and Windows 11 22H2+ (for bundled bsdtar that extracts .7z assets)
- No system backend support
macOS
- Supported: CPU, Metal
- Metal recommended for all Macs with Metal support