lemonade

🌩️ Self Hosted Runners 🌩️ Documentation

This page documents how to set up and maintain self-hosted runners for lemonade-sdk.

Topics:

What are Self-Hosted Runners?

A “runner” is a computer that has installed GitHub’s runner software, which runs a service that makes the laptop available to run GitHub Actions. In turn, Actions are defined by Workflows, which specify when the Action should run (manual trigger, CI, CD, etc.) and what the Action does (run tests, build packages, run an experiment, etc.).

You can read about all this here: GitHub: About self-hosted runners.

Runner Labels

Workflows target self-hosted runners by the labels the runner carries. We use two kinds of labels:

Capability labels

These describe what a runner can do. A workflow should request only the capability labels it actually needs.

Label Meaning Typical workflows that request it
vulkan Runner can execute Vulkan GPU workloads llama.cpp Vulkan backend, whisper.cpp Vulkan backend
rocm Runner can execute ROCm GPU workloads llama.cpp ROCm backend, stable-diffusion.cpp ROCm backend
cuda Runner can execute CUDA GPU workloads TBD
xdna2 Runner has a Ryzen AI 300/400 series NPU ryzenai backend, flm (FastFlowLM) backend

A job that exercises more than one backend should request all the labels it needs (e.g., [Windows, vulkan, rocm] for a test that runs both Vulkan and ROCm cases). GitHub Actions requires the runner to carry every label in the runs-on list.

CPU-only jobs should target GitHub-hosted runners when possible.

Hardware labels

These pin a job to a specific hardware class when a capability label alone isn’t enough to distinguish it. Combine them with a capability label.

Label Hardware When to use
stx-halo Strix Halo (AMD Ryzen AI Max 300 series) ROCm workloads that specifically need Strix Halo’s iGPU, e.g., [Windows, rocm, stx-halo]

Add new hardware labels here as the pool grows.

Applying labels to a runner

Capability and hardware labels must be present on each runner for the workflow to match. Add or remove them from the runners page: click the runner, click the gear icon in the Labels section, and check/uncheck as needed. Apply only the labels that reflect the runner’s real capabilities — never add rocm to a runner that can’t actually run ROCm, for example, because that will cause workflows to be scheduled on a machine that can’t complete them.

Typical label sets by hardware

Hardware Labels to apply
Ryzen AI 300-series laptop (NPU + Vulkan iGPU + ROCm iGPU) xdna2, vulkan, rocm
Strix Halo xdna2, rocm, stx-halo

New Runner Setup

This guide will help you set up a computer as a GitHub self-hosted runner.

New Machine Setup

Runner Configuration

These steps will place your machine into the production pool.

  1. IMPORTANT: before doing step 2, read this:
    • Use a powershell administrator mode terminal
    • Enable permissions by running Set-ExecutionPolicy RemoteSigned
    • When running ./config.cmd in step 2, make the following choices:
      • Name of the runner group = stx
      • For the runner name, call it NAME-TYPE-NUMBER, where NAME is your alias and NUMBER would tell you this is the Nth machine of TYPE you’ve added. TYPE examples include stx, stx-halo, phx, etc.
      • Apply capability labels (xdna2, vulkan, rocm, etc. and any hardware labels like stx-halo).
      • Accept the default for the work folder
      • You want the runner to function as a service (respond Y)
      • User account to use for the service = NT AUTHORITY\SYSTEM (not the default of NT AUTHORITY\NETWORK SERVICE)
  2. Follow the instructions here for Windows Ubuntu, minding what we said in step 1: https://github.com/organizations/lemonade-sdk/settings/actions/runners/new
  3. You should see your runner show up in the stx runner group in the lemonade-sdk org

Maintenance and Troubleshooting

This is a production system and things will go wrong. Here is some advice on what to do.

Check your runner’s status

You can run Get-EventLog -LogName Application -Source ActionsRunnerService in a powershell terminal on your runner to get more information about what it’s been up to.

If there have been any problems recently, they may show up like:

Actions are failing unexpectedly

Actions fail all the time, often because they are testing buggy code. However, sometimes an Action will fail because something is wrong with the specific runner that ran the Action.

If this happens to you, here are some steps you can take (in order):

  1. Take note of which runner executed your Action. You can check this by going to the Set up job section of the Action’s log and checking the Runner name: field. The machine name in that field will correspond to a machine on the runners page.
  2. Re-queue your job. It is possible that that the failure is a one-off, and it will work the next time on the same runner. Re-queuing also gives you a chance of getting a runner that is in a healthier state.
  3. If the same runner is consistently failing, it is probably in an unhealthy state (or you have a bug in your code and you’re just blaming the runner). If a runner is in an unhealthy state:
    1. Take the laptop offline so that it stops being allocated Actions.
    2. Open an Issue. Assign it to the maintainer of the laptop (their name should be in the runner’s name). Link the multiple failed workflows that have convinced you that this runner is unhealthy.
    3. Re-queue your job. You’ll definitely get a different runner now since you took the unhealthy runner offline.
  4. If all runners are consistently failing your workflow, seriously think about whether your code is the problem.

Take a laptop offline

If you need to do some maintenance on your laptop, use it for dev/demo work, etc. you can remove it from the runners pool.

Also, if someone else’s laptop is misbehaving and causing Actions to fail unexpectedly, you can remove that laptop from the runners pool to make sure that only healthy laptops are selected for work.

There are three options:

Option 1, which is available to anyone in the lemonade-sdk org: remove the runner’s capability labels.

Option 2, which requires physical/remote access to the laptop:

Option 3 is to just turn the laptop off :)

Creating Workflows

GitHub Workflows define the Actions that run on self-hosted laptops to perform testing and experimentation tasks. This section will help you learn about what capabilities are available and show some examples of well-formed workflows.

Capabilities and Limitations

Because we use self-hosted systems, we have to be careful about what we put into these workflows so that we avoid:

Here are some general guidelines to observe when creating or modifying workflows. If you aren’t confident that you are properly following these guidelines, please contact someone to review your code before opening your PR.