Organizations for LLM Evaluation

Organizations for LLM Evaluation

1. Why is LLM evaluation necessary?

The rapid development and increasingly complex capabilities of Large Language Models (LLMs) pose an urgent need for their comprehensive and systematic evaluation. Evaluating LLMs is not merely about measuring academic performance but is also a crucial process to thoroughly understand their true capabilities, potential limitations, level of trustworthiness, and their possible impacts on individuals and society. Without reliable evaluation methods, the selection, deployment, and management of LLMs would become haphazard and fraught with risks.

The main aspects to consider when evaluating LLMs include:

  • Performance: This is the most fundamental aspect, assessing the LLM's ability to complete specific language tasks. Commonly used metrics include accuracy, precision, recall, F1-score, BLEU (for machine translation), ROUGE (for text summarization), etc.
  • Trustworthiness: This is a multifaceted concept, encompassing many important factors to ensure LLMs operate responsibly. These factors include:
    • Truthfulness: The ability to provide accurate information, avoiding the generation of "hallucinations" or misinformation.
    • Safety: The ability to resist generating harmful, dangerous content, or being misused for malicious purposes.
    • Fairness: Avoiding discriminatory behavior or bias based on sensitive characteristics such as gender, race, or religion.
    • Privacy: Protecting personal information and not disclosing sensitive data that may have been present during training or interaction.
    • Robustness: The ability to maintain stable performance when faced with noise, anomalous input data, or deliberate attacks.
    • Transparency: The degree to which the model's decision-making process can be understood or audited.
    • Accountability: The ability to identify the source of information generated by the model or assign responsibility when errors occur.
  • Utility & Relevancy: Assessing whether the answers or content generated by the LLM are genuinely useful, relevant to the user's query, concise, and easy to understand. Metrics such as task completion rate and answer relevancy are considered here.
  • Efficiency: This includes processing speed (latency), throughput, and the computational cost required to train and operate the model.

Initially, the focus of LLM evaluation was often primarily on the ability to perform traditional NLP tasks. However, as LLMs have become more powerful and deeply integrated into many aspects of life, factors related to trustworthiness, safety, and ethics are increasingly emphasized by the research community and policymakers.

This reflects the maturation of the AI field and a growing awareness of the potential societal impacts, both positive and negative, of this technology. Because a model might perform very well in an experimental context or on a specific benchmark dataset, it could fail or perform poorly in a real-world application or a different knowledge domain. Therefore, LLM evaluation needs to consider the specific application context in which the model will be deployed, rather than relying solely on general performance figures.

2. Reputable Evaluation Organizations and Frameworks.

In response to the growing demand for objective and comprehensive LLM evaluation, many research organizations and government agencies have developed specialized evaluation frameworks and platforms.

2.1. Stanford CRFM - HELM (Holistic Evaluation of Language Models)

The Center for Research on Foundation Models (CRFM) at Stanford University developed HELM (Holistic Evaluation of Language Models), an open-source Python framework aimed at evaluating foundation models, including LLMs and multimodal models, in a comprehensive, reproducible, and transparent manner. HELM not only measures accuracy but also considers other aspects such as efficiency, bias, and toxicity. Notable versions and applications of HELM include:

  • HELM Classic and HELM Lite: Provide general leaderboards for core capabilities.
  • MedHELM: A branch of HELM specifically customized to evaluate LLMs in medical applications. MedHELM focuses on tasks with practical value in the healthcare industry, using data from electronic health records and realistic clinical scenarios to assess the readiness of LLMs for the medical environment.
  • HELM Capabilities: Focuses on evaluating core LLM capabilities such as general knowledge, reasoning ability, instruction following, dialogue, and mathematical reasoning, through carefully selected scenarios from existing benchmarks.
  • HELM Safety (using AIR-Bench) and HELM Instruct: Specialized leaderboards for critical aspects like safety and instruction-following capabilities.

2.2. TrustLLM

This is a comprehensive evaluation framework developed through the collaboration of multiple universities and research organizations. TrustLLM focuses on assessing the trustworthiness of LLMs through eight key dimensions: fairness, machine ethics, privacy, robustness, safety, truthfulness, accountability, and transparency. The framework utilizes 30 public datasets as benchmarks to test these aspects across a range of tasks from simple to complex.

2.3. US AI Safety Institute (AISI)

The U.S. AI Safety Institute, part of the U.S. Department of Commerce, plays a crucial role in establishing standards and conducting safety evaluations for advanced AI models. AISI has collaborated with companies like Scale AI to jointly develop testing criteria and expand evaluation capabilities for model developers of all sizes. AISI's focus is on ensuring that AI models, especially powerful LLMs, are thoroughly tested for potential risks before widespread deployment.

2.4. Credo AI - Model Trust Scores

Credo AI offers an AI governance platform that includes "Model Trust Scores." This scoring framework helps governance teams define appropriate requirements and guides deployers in conducting additional evaluations based on business needs, risk thresholds, legal obligations, and corporate policies. Credo AI focuses on four main aspects: model capability (raw performance and ability to perform specific tasks), safety measures (from toxicity control to bias mitigation), operational cost/affordability, and system speed. They utilize both ecosystem-wide benchmarks (MMLU, GPQA, LiveBench, etc.) and context-specific evaluations.

2.5. MLCommons - AILuminate suite

MLCommons, an industry consortium focused on creating benchmarks and datasets for AI, has developed the AILuminate suite. This suite is designed to evaluate 12 hazard categories related to LLMs, including content safety (such as child sexual exploitation, sexual content, hate speech), criminal activities, harmful advice (such as incorrect professional advice, suicide and self-harm), and other risks like defamation, privacy, intellectual property, and indiscriminate weapons.

The emergence and development of numerous specialized evaluation organizations and frameworks demonstrate a strong effort by the global AI community to standardize processes and enhance the quality of LLM evaluation. A noteworthy point is the increasing collaboration between the public and private sectors, as seen in the case of AISI and Scale AI.

This reflects the understanding that ensuring safe and trustworthy AI is a shared responsibility, requiring the joint efforts of policymakers, researchers, and technology developers. Furthermore, the prevailing trend in modern evaluation frameworks is comprehensiveness (holistic). Approaches like HELM and TrustLLM do not merely measure a few isolated performance metrics but strive to cover a broad spectrum of aspects, from basic capabilities to trustworthiness, safety, and ethics. This reflects the inherent complexity of LLMs and the multifaceted impacts they can have in the real world.

LLM leaderboards serve as quick reference points, helping the community track progress and compare the relative performance of different models. Below are some prominent leaderboards:

3.1. LMSYS Chatbot Arena

This is a unique platform that uses crowdsourcing to evaluate LLMs. Users anonymously interact with two chatbots and vote for the response they deem better. The ranking of models is determined by an Elo rating system, similar to that used in chess. Chatbot Arena also integrates results from academic benchmarks like MT-Bench and MMLU to provide a more multifaceted view.

3.2. Hugging Face Open LLM Leaderboard

As one of the most popular leaderboards for open-source LLMs, the Hugging Face Open LLM Leaderboard uses the Eleuther AI Language Model Evaluation Harness for automated evaluation. Models are tested on a standard set of benchmarks including ARC (AI2 Reasoning Challenge), HellaSwag, MMLU (Massive Multitask Language Understanding), TruthfulQA, Winogrande, and GSM8K (Grade School Math 8K). Hugging Face also maintains a separate leaderboard for models that have been quantized to low bit precision (Low-bit Quantized Open LLM Leaderboard), focusing on the efficiency of compressed models. Recently, Hugging Face updated its ranking methodology by using normalized scores to balance the weight of each benchmark.

3.3. Orq.ai LLM Leaderboard

This platform provides the ability to compare LLMs based on standard benchmarks, while also categorizing performance by specific tasks such as Multilingual Q&A (using MMLU), Multi-task Reasoning (using GPQA Diamond), and Math Problem-Solving (using MATH 500). Additionally, Orq.ai compares models based on processing speed and operational costs.

3.4. BigCodeBench

This leaderboard specializes in evaluating the programming capabilities of LLMs. It uses realistic and challenging programming tasks, including code completion based on detailed descriptions and code generation from concise natural language instructions.

3.5. Other Specialized Leaderboards:

In addition to the general leaderboards mentioned above, there are many leaderboards focusing on specific aspects or types of LLMs:

  • Trustbit LLM Benchmark: Evaluates LLMs monthly based on real-world benchmark data from software products, focusing on enterprise applications such as document processing, CRM integration, marketing support, and source code generation.
  • Oobabooga benchmark: Assesses academic knowledge and logical reasoning abilities using self-created multiple-choice questions.
  • OpenCompass: CompassRank: Provides objective evaluations for advanced language and vision models.
  • EQ-Bench: Evaluates the emotional intelligence of LLMs through their ability to understand complex emotional dynamics in conversations.
  • Berkeley Function-Calling Leaderboard: Assesses the ability of LLMs to accurately perform function calling or tool use.
  • The CanAiCode Leaderboard: Specifically for testing small language models (SLMs) in text-to-code generation tasks.
  • Open Multilingual LLM Evaluation Leaderboard: Ranks LLMs across various languages, especially non-English languages, using translated benchmarks such as AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA.
  • MTEB (Massive Text Embedding Benchmark) Leaderboard: Specializes in evaluating text embedding models across multiple tasks and datasets.
  • AlpacaEval Leaderboard: Evaluates instruction-following and language understanding capabilities using an automated evaluation system based on the AlpacaFarm dataset.

The diversity of leaderboards reflects the increasing diversity and specialization of LLMs themselves. There is no single "universal yardstick" for all models. Each leaderboard has its own priorities and methodologies, focusing on different aspects such as open-source performance, specialized capabilities (programming, emotional intelligence, multilingualism), or suitability for enterprise applications. However, using and interpreting results from leaderboards also requires caution. One major challenge is the issue of "data contamination," where a model may have been trained (intentionally or unintentionally) on the very data used in benchmark tests. This can skew evaluation results and create an inaccurate impression of the model's true capabilities. Leaderboard administrators like Hugging Face are actively working to detect and mitigate this issue. Besides, many automated benchmarks may not fully capture the subtle aspects of language such as naturalness, creativity, or true usefulness in a conversation. This is why human-based evaluation methods, as used in the LMSYS Chatbot Arena, still play an important role. Human preference provides a valuable complementary perspective, especially for free-form generation tasks where there is no single "correct answer." Therefore, users should consider results from multiple sources and clearly understand the methodology behind each leaderboard to obtain the most comprehensive assessment.

4. Conclusion

In summary, evaluating Large Language Models is a multifaceted and continually evolving task, playing a crucial role in harnessing the immense potential while mitigating the risks of this technology. From measuring basic performance to considering more complex aspects such as trustworthiness, safety, and fairness, the research community and specialized organizations have been and are continuing to strive to build increasingly comprehensive evaluation methods and toolkits. The emergence of many reputable frameworks like HELM and TrustLLM, along with the guiding role of agencies like the US AI Safety Institute, demonstrates a strong commitment towards developing responsible AI.

The diverse range of leaderboards, from general platforms like LMSYS's Chatbot Arena and Hugging Face's Open LLM Leaderboard to specialized leaderboards for programming or emotional intelligence, provides valuable insights into the capabilities of each model. However, interpreting the results requires caution, an awareness of challenges such as data contamination, and the necessity of both automated and human evaluation. As LLMs continue to advance in scale and capability, evaluation methods must also be continuously improved and adapted, ensuring that we can guide the development of this technology safely, effectively, and for the practical

References