🔬 GenAI Evaluation, Benchmarking & Safety

  • LM Evaluation Harness (EleutherAI/lm-evaluation-harness): The gold standard framework used by the global AI community (including Hugging Face) to benchmark generative models across thousands of standard academic datasets (like MMLU, GSM8K, and HumanEval).
    👉 EleutherAI/lm-evaluation-harness GitHub Repository
  • DeepEval (confident-ai/deepeval): A unit-testing framework specifically for evaluating LLM application outputs. It researches automated methods to test for model hallucinations, toxicity, answer relevancy, and bias.
    👉 confident-ai/deepeval GitHub Repository [12]

You may also like...