This page documents how to set up and maintain self-hosted runners for lemonade-sdk.
Topics:
A “runner” is a computer that has installed GitHub’s runner software, which runs a service that makes the laptop available to run GitHub Actions. In turn, Actions are defined by Workflows, which specify when the Action should run (manual trigger, CI, CD, etc.) and what the Action does (run tests, build packages, run an experiment, etc.).
You can read about all this here: GitHub: About self-hosted runners.
Workflows target self-hosted runners by the labels the runner carries. We use two kinds of labels:
These describe what a runner can do. A workflow should request only the capability labels it actually needs.
| Label | Meaning | Typical workflows that request it |
|---|---|---|
vulkan |
Runner can execute Vulkan GPU workloads | llama.cpp Vulkan backend, whisper.cpp Vulkan backend |
rocm |
Runner can execute ROCm GPU workloads | llama.cpp ROCm backend, stable-diffusion.cpp ROCm backend |
cuda |
Runner can execute CUDA GPU workloads | TBD |
xdna2 |
Runner has a Ryzen AI 300/400 series NPU | ryzenai backend, flm (FastFlowLM) backend |
A job that exercises more than one backend should request all the labels it needs (e.g., [Windows, vulkan, rocm] for a test that runs both Vulkan and ROCm cases). GitHub Actions requires the runner to carry every label in the runs-on list.
CPU-only jobs should target GitHub-hosted runners when possible.
These pin a job to a specific hardware class when a capability label alone isn’t enough to distinguish it. Combine them with a capability label.
| Label | Hardware | When to use |
|---|---|---|
stx-halo |
Strix Halo (AMD Ryzen AI Max 300 series) | ROCm workloads that specifically need Strix Halo’s iGPU, e.g., [Windows, rocm, stx-halo] |
Add new hardware labels here as the pool grows.
Capability and hardware labels must be present on each runner for the workflow to match. Add or remove them from the runners page: click the runner, click the gear icon in the Labels section, and check/uncheck as needed. Apply only the labels that reflect the runner’s real capabilities — never add rocm to a runner that can’t actually run ROCm, for example, because that will cause workflows to be scheduled on a machine that can’t complete them.
| Hardware | Labels to apply |
|---|---|
| Ryzen AI 300-series laptop (NPU + Vulkan iGPU + ROCm iGPU) | xdna2, vulkan, rocm |
| Strix Halo | xdna2, rocm, stx-halo |
This guide will help you set up a computer as a GitHub self-hosted runner.
rocm capability instead of cuda, you must disable it in device managerSet-ExecutionPolicy -ExecutionPolicy RemoteSignedThese steps will place your machine into the production pool.
Set-ExecutionPolicy RemoteSigned./config.cmd in step 2, make the following choices:
stxNAME-TYPE-NUMBER, where NAME is your alias and NUMBER would tell you this is the Nth machine of TYPE you’ve added. TYPE examples include stx, stx-halo, phx, etc.xdna2, vulkan, rocm, etc. and any hardware labels like stx-halo).NT AUTHORITY\SYSTEM (not the default of NT AUTHORITY\NETWORK SERVICE)| Follow the instructions here for Windows | Ubuntu, minding what we said in step 1: https://github.com/organizations/lemonade-sdk/settings/actions/runners/new |
stx runner group in the lemonade-sdk orgThis is a production system and things will go wrong. Here is some advice on what to do.
You can run Get-EventLog -LogName Application -Source ActionsRunnerService in a powershell terminal on your runner to get more information about what it’s been up to.
If there have been any problems recently, they may show up like:
Actions fail all the time, often because they are testing buggy code. However, sometimes an Action will fail because something is wrong with the specific runner that ran the Action.
If this happens to you, here are some steps you can take (in order):
Set up job section of the Action’s log and checking the Runner name: field. The machine name in that field will correspond to a machine on the runners page.If you need to do some maintenance on your laptop, use it for dev/demo work, etc. you can remove it from the runners pool.
Also, if someone else’s laptop is misbehaving and causing Actions to fail unexpectedly, you can remove that laptop from the runners pool to make sure that only healthy laptops are selected for work.
There are three options:
Option 1, which is available to anyone in the lemonade-sdk org: remove the runner’s capability labels.
xdna2, vulkan, and rocm (see Runner Labels). Removing every capability label from a runner will drain it completely — no workflow will match.Option 2, which requires physical/remote access to the laptop:
Stop-Service "actions.runner.*".Start-Service "actions.runner.*".Option 3 is to just turn the laptop off :)
GitHub Workflows define the Actions that run on self-hosted laptops to perform testing and experimentation tasks. This section will help you learn about what capabilities are available and show some examples of well-formed workflows.
Because we use self-hosted systems, we have to be careful about what we put into these workflows so that we avoid:
Here are some general guidelines to observe when creating or modifying workflows. If you aren’t confident that you are properly following these guidelines, please contact someone to review your code before opening your PR.
name: Test Lemonade on NPU and Hybrid with OGA environment 🌩️on: pull request: in your workflow until after a reviewer has signed off.runs-on: [Windows, xdna2] for NPU work, runs-on: [Linux, vulkan, rocm] for a job that exercises both GPU backends. Do not ask for xdna2 or a GPU capability if your job is CPU-only — use [self-hosted, Windows] / [self-hosted, Linux], or move the step to a GitHub-hosted runner like runs-on: windows-latest when possible..\) is always ok, because that will end up in C:\actions-runner\_work\REPO, which is always wiped between tests.AppData, Program Files, etc. is not advisable because that software will persist across tests. See the setup section to see which software is already expected on the system.python -m venv .venv.
C:\actions-runner\_work\REPO, which is wiped between tests.pip install commands. Otherwise your workflow will modify the system Python installation!if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE } after any line of script where it is that is particularly important to fail the workflow if the program in the preceding line raised an error.
_work directory so that it will be wiped after each job.
$Env:HF_HOME=".\hf-cache"_work directory so that it will be wiped after each job.
lemond: lemond .\ci-cache