This guide demonstrates how to use Lemonade with LM-Evaluation-Harness (lm-eval) to evaluate language model performance across a variety of standardized benchmarks. Whether you’re comparing different model implementations or validating model capabilities, lm-eval provides a comprehensive framework for model assessment. Refer to Lemonade Server to learn more about the server interface used by lm-eval for evaluations.
LM-Evaluation-Harness (often called lm-eval
) is an open-source framework for evaluating language models across a wide variety of tasks and benchmarks. Developed by EleutherAI, it has become a standard tool in the AI research community for consistent evaluation of language model capabilities.
The framework supports evaluating models on more than 200 tasks and benchmarks, including popular ones such as:
Lemonade supports integration with lm-eval through its local LLM server. The basic workflow involves:
Please refer to the installation guide, using the PyPI or From Source methods, for environment setup.
In a terminal with your environment activated, run the following command:
lemonade-server-dev serve
This starts a local LLM server on port 8000 by default.
Use the following PowerShell command to load a model into the server:
Invoke-RestMethod -Uri "http://localhost:8000/api/v1/load" -Method Post -Headers @{ "Content-Type" = "application/json" } -Body '{ "checkpoint": "meta-llama/Llama-3.2-1B-Instruct", "recipe": "hf-cpu" }'
Where:
checkpoint
can be changed to use other from Hugging Face (e.g., “meta-llama/Llama-3.2-3B-Instruct”)recipe
can be changed to use different backends (e.g., “oga-cpu” for CPU inference on OnnxRuntime GenAI, “oga-hybrid” for AMD Ryzen™ AI acceleration). For more information on Lemonade recipes, see the Lemonade API ReadMe.Now that the model is loaded, open a new PowerShell terminal, activate your environment, and run lm-eval tests using the following command:
lm_eval --model local-completions --tasks mmlu_abstract_algebra --model_args model=meta-llama/Llama-3.2-1B-Instruct,base_url=http://localhost:8000/api/v1/completions,num_concurrent=1,max_retries=0,tokenized_requests=False --limit 5
Where:
--tasks
as needed to run other tests (e.g., --tasks gsm8k
, --tasks wikitext
, etc.)
For detailed tasks visit lm-evalcheckpoint name
should match the model name loaded in step 2The framework implements three primary evaluation methodologies that use different capabilities of language models:
These tests evaluate a model’s ability to assign probabilities to different possible answers. The model predicts which answer is most likely based on conditional probabilities.
Example: In MMLU (Massive Multitask Language Understanding), the model is given a multiple-choice question and must assign probabilities to each answer choice. The model’s performance is measured by how often it assigns the highest probability to the correct answer.
Step 1: Environment setup and installation - Please refer to the installation guide, using the PyPI or From Source methods, for environment setup.
Step 2: Start the Lemonade Server.
In a terminal with your environment activated, run the following command:
lemonade-server-dev serve
Step 3: Load a Model
Invoke-RestMethod -Uri "http://localhost:8000/api/v1/load" -Method Post -Headers @{ "Content-Type" = "application/json" } -Body '{ "checkpoint": "meta-llama/Llama-3.2-1B-Instruct", "recipe": "hf-cpu" }'
Step 4: Run MMLU Tests
lm_eval --model local-completions --tasks mmlu_abstract_algebra --model_args model=meta-llama/Llama-3.2-1B-Instruct,base_url=http://localhost:8000/api/v1/completions,num_concurrent=1,max_retries=0,tokenized_requests=False --limit 5
These tests evaluate a model’s ability to predict text by measuring the perplexity on held-out data. The model assigns probabilities to each token in a sequence, and performance is measured by how well it predicts the actual next tokens.
Example: In perplexity benchmarks like WikiText, the model is evaluated on how well it can predict each token in a document, using a rolling window approach for longer contexts.
Step 1: Environment setup and installation - Please refer to the installation guide, using the PyPI or From Source methods, for environment setup.
Step 2: Start the Lemonade Server.
In a terminal with your environment activated, run the following command:
lemonade-server-dev serve
Step 3: Load a Model
Invoke-RestMethod -Uri "http://localhost:8000/api/v1/load" -Method Post -Headers @{ "Content-Type" = "application/json" } -Body '{ "checkpoint": "meta-llama/Llama-3.2-1B-Instruct", "recipe": "hf-cpu" }'
Step 4: Run Wikitext Tests
lm_eval --model local-completions --tasks wikitext --model_args model=meta-llama/Llama-3.2-1B-Instruct,base_url=http://localhost:8000/api/v1/completions,num_concurrent=1,max_retries=0,tokenized_requests=False --limit 5
These tests evaluate a model’s ability to generate full responses to prompts. The model generates text that is then evaluated against reference answers or using specific metrics.
Example: In GSM8K (Grade School Math), the model is given a math problem and must generate a step-by-step solution. Performance is measured by whether the final answer is correct.
Step 1: Environment setup and installation - Please refer to the installation guide, using the PyPI or From Source methods, for environment setup.
Step 2: Start the Lemonade Server.
In a terminal with your environment activated, run the following command:
lemonade-server-dev serve
Step 3: Load a Model
Invoke-RestMethod -Uri "http://localhost:8000/api/v1/load" -Method Post -Headers @{ "Content-Type" = "application/json" } -Body '{ "checkpoint": "meta-llama/Llama-3.2-1B-Instruct", "recipe": "hf-cpu" }'
Step 4: Run GSM8k Tests
lm_eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Llama-3.2-1B-Instruct,base_url=http://localhost:8000/api/v1/completions,num_concurrent=1,max_retries=0,tokenized_requests=False --limit 5
lm-eval provides detailed results for each benchmark, typically including:
Results are provided in a structured format at the end of evaluation, with both detailed and summary statistics.