🔬 GenAI Evaluation, Benchmarking & Safety
- LM Evaluation Harness (
EleutherAI/lm-evaluation-harness): The gold standard framework used by the global AI community (including Hugging Face) to benchmark generative models across thousands of standard academic datasets (like MMLU, GSM8K, and HumanEval).
👉 EleutherAI/lm-evaluation-harness GitHub Repository - DeepEval (
confident-ai/deepeval): A unit-testing framework specifically for evaluating LLM application outputs. It researches automated methods to test for model hallucinations, toxicity, answer relevancy, and bias.
👉 confident-ai/deepeval GitHub Repository [1, 2]