Evalverse: Simplifying Large Language Model Evaluation

2024/04/16 | Written By: YoungHoon Jeon, Jihoo Kim, Wonho Song, Dahyun Kim, Yoonsoo Kim, Yungi Kim, Chanjun Park

In the rapidly advancing field of artificial intelligence, evaluating Large Language Models (LLMs) is often a complex and disjointed task. Acknowledging the necessity for a more integrated method, Upstage proudly presents Evalverse, an innovative library designed to simplify and unify the evaluation process. This tool not only facilitates a more systematic assessment of LLMs but also makes cutting-edge evaluation techniques accessible to a broader audience, ensuring that advancements in AI are both inclusive and comprehensive.

What is Evalverse?

Evalverse is a centralized platform designed to streamline the evaluation of LLMs by integrating a variety of evaluation methodologies. It incorporates well-known frameworks such as lm-evaluation-harness and FastChat as submodules. This architecture enables Evalverse to serve as both a unified and expandable library and simplifies the process of updating, ensuring the tool remains at the forefront of technological advancement.

Key Features

Unified evaluation with Submodules: Evalverse leverages Git submodules to integrate and manage external evaluation frameworks, such as lm-evaluation-harness and FastChat. This approach allows for the straightforward addition of new submodules, facilitating the support of a broader range of evaluation frameworks. Moreover, it enables the seamless incorporation of upstream changes, keeping Evalverse up-to-date in the dynamic landscape of LLM technology.
No-code evaluation request: Evalverse introduces a no-code evaluation feature, accessible through Slack interactions. Users simply initiate a request by typing Request! in a direct message or a designated Slack channel with an active Evalverse Slack bot. The bot then guides the user through selecting a model from the Huggingface hub or specifying a local model directory, culminating in the execution of the evaluation process without requiring direct code interaction.
(*Currently, we only support Slack, but we plan to expand to other platforms in the future.)
LLM evaluation report: Evalverse enhances user convenience by providing detailed evaluation reports in a no-code format. By entering Report!, users can prompt the system to generate comprehensive evaluation reports. Following the user selection of specific models and evaluation criteria, Evalverse computes average scores and rankings based on stored data. These results are then presented in an insightful report, complete with performance tables and graphical visualizations, facilitating an in-depth understanding of model performance.

Architecture of Evalverse

The architecture of Evalverse is thoughtfully designed with multiple key components—Submodules, Connectors, Evaluators, a Compute Cluster, and a Database. These elements collaborate efficiently to ensure that evaluations are conducted smoothly and effectively. This robust system is built to support diverse interaction modes, accommodating both no-code evaluations through Slack and conventional code-based evaluations. This dual-mode functionality underscores Evalverse's commitment to providing flexibility and catering to the varied preferences of its users, making advanced technology accessible and user-friendly.

Submodule. The Submodule serves as the evaluation engine that is responsible for the heavy lifting involved in evaluating LLMs. Publicly available LLM evaluation libraries can be integrated into Evalverse as submodules. This component makes Evalverse expandable, thereby ensuring that the library remains up-to-date.

Connector. The Connector plays a role in linking the Submodules with the Evaluator. It contains evaluation scripts, along with the necessary arguments, from various external libraries.
Evaluator. The Evaluator performs the requested evaluations on the Compute Cluster by utilizing the evaluation scripts from the Connector. The Evaluator can receive evaluation requests either from the Reporter, which facilitates a no-code evaluation approach, or directly from the end-user for code-based evaluation.
Compute Cluster. The Compute Cluster is the collection of hardware accelerators needed to execute the LLM evaluation processes. When the Evaluator schedules an evaluation job to be run, the Compute Cluster fetches the required model and data files from the Database. The results of the evaluation jobs are sent to the Database for storage.
Database. The Database stores the model files and data needed in the evaluation processes, along with evaluation results. The stored evaluation results are used by the Reporter to create evaluation reports for the user.
Reporter. The Reporter handles the evaluation and report requests sent by the users, allowing for a no-code approach to LLM evaluation. The Reporter sends the requested evaluation jobs to the Evaluator and fetches the evaluation results from the Database, which are sent to the user via an external communication platform such as Slack. Through this, users can receive table and figures that summarize evaluation results.

License

Evalverse is completely freely-accessible open-source and licensed under the Apache License 2.0.
For more detailed information, please visit our Dataverse documentation page: Evalverse Documentation

Practical Application and Demonstration

The practicality of Evalverse is effectively illustrated in a demonstrative video, which highlights the user-friendly interface that allows users to engage with the system via Slack. This feature enables users to effortlessly request evaluations and receive comprehensive reports. The seamless integration and ease of use make Evalverse an invaluable resource for both researchers and practitioners, simplifying complex processes and fostering efficiency in LLM evaluation. This demonstration underscores Evalverse's commitment to enhancing accessibility and utility in the field of artificial intelligence.

Evalverse marks a substantial leap forward in the realm of LLM evaluation. Offering a unified, accessible, and readily expandable framework, it adeptly confronts the prevalent issues of fragmentation in evaluation tools and high technical barriers to entry. The potential of Evalverse to revolutionize LLM assessment practices is immense, poised to significantly boost the development and deployment of these robust models across a variety of industries. This advancement underscores Evalverse's commitment to driving innovation and broadening the accessibility of cutting-edge AI technologies. Feel free to embark on your own LLM experiments with Evalverse!

Citation

If you want to cite our Evalverse project, feel free to use the following bibtex!

@article{kim2024evalverse,
  title={Evalverse: Unified and Accessible Library for Large Language Model Evaluation},
  author={Kim, Jihoo and Song, Wonho and Kim, Dahyun and Kim, Yunsu and Kim, Yungi and Park, Chanjun},
  journal={arXiv preprint arXiv:2404.00943},
  year={2024}
}

Evalverse: Revolutionizing Large Language Model Evaluation with a Unified, User-Friendly Framework

What is Evalverse?

Key Features

Architecture of Evalverse

Practical Application and Demonstration

Deploying Solar with BentoML

Upstage Raises $72 Million in Series B Funding