Lead: This working group is led by Michele Balistreri, whose handle is @bitgamma on GitHub and @mikkoph on Discord.
Background: Running a model well requires choosing the right backend and tuning performance parameters such as batch size, GPU layer offload, thread count, and context size. Today, users either guess, look up Reddit threads, or ask an LLM — approaches that are error-prone and often outdated due to knowledge cutoff. Lemonade already has a bench command that measures TTFT, TPS, and VRAM usage across backends and parameter combinations, and a recipe system that defines per-model configuration. What is missing is a mechanism to turn benchmark data into actionable, hardware-aware defaults that apply automatically.
Why: A user should install Lemonade and get good performance without needing to understand backend flags or run manual benchmarks. Hardware-aware defaults lower the barrier to entry and make Lemonade competitive with cloud-based alternatives where performance is abstracted away.
Goal: Enable Lemonade instances to self-optimize models and backends by detecting the machine’s hardware profile and applying community-validated performance parameters. The end state is that a user pulls a model, loads it, and Lemonade selects the best backend and tuning flags for their hardware — with the option to override or fine-tune manually.
Please see the general contribution guidelines, then contact @mikkoph on Discord before getting started to discuss the roadmap.
This working group focuses on performance parameters (batch size, GPU layers, threads, context size, backend selection) that are determined by hardware characteristics. Quality parameters such as temperature, top_p, and chat template are model- or use-case-dependent and are explicitly out of scope.
Roadmap items are high-level objectives that may span multiple issues and PRs. Details can also be re-defined.
strix-halo-128gb → { "llamacpp_backend": "vulkan", "llamacpp_vulkan_args": { "-b 2048 -ub 1024" }, "vllm_args": { ... } }). The schema is backend-agnostic so it works for llama.cpp, FastFlowLM, vLLM, RyzenAI, and future backends.lemond, using data already collected by the system-info endpoint (GPU name, VRAM, bandwidth, unified vs. discrete memory).lemond detects the machine archetype and caches it.lemonade status displays the detected archetype and any active auto-tune overrides.lemonade bench --submit generates a structured benchmark contribution from local benchmark runs. The submission workflow (upload endpoint, PR-based contribution, or other) will be designed based on infrastructure study.lemond can suggest re-benchmarking or flag the profile for review.