lm-evaluation-harness - LLM Benchmarking

MLOps

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

实战案例

入门快速入门

lm-evaluation-harness - LLM Benchmarking快速入门

ML系统在Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEv方面需要工程化实施,从实验到生产全流程。

展开对话

请以lm-evaluation-harness - LLM Benchmarking的身份,帮我处理以下任务:需要搭建ML模型训练和部署管线,从实验到生产全流程。

# lm-evaluation-harness - LLM Benchmarking

获取提示词