Hugging Face Introduces Community Evals For Transparent Model Benchmarking

Hugging Face has launched Community Evals, a feature that enables benchmark datasets on the Hub to host their own leaderboards and automatically collect evaluation results from model repositories. The system decentralizes the reporting and tracking of benchmark scores by relying on the Hub’s Git-based infrastructure, making submissions transparent, versioned, and reproducible.

Under the new system, dataset repositories can register as benchmarks. Once registered, they automatically collect and display evaluation results submitted across the Hub. Benchmarks define their evaluation specifications in an eval.yaml file based on the Inspect AI format, which describes the task and evaluation procedure so that results can be reproduced. Initial benchmarks available through this system include MMLU-Pro, GPQA, and HLE, with plans to expand to additional tasks over time.

Model repositories can now store evaluation scores in structured YAML files located in a .eval_results/ directory. These results appear on the model card and are automatically linked to corresponding benchmark datasets. Both results submitted by model authors and those proposed through open pull requests are aggregated. Model authors retain the ability to close pull requests or hide results associated with their models.

The system also allows any Hub user to submit evaluation results for a model via pull request. Community-submitted scores are labeled accordingly and can reference external sources such as research papers, model cards, third-party evaluation platforms, or evaluation logs. Because the Hub operates on Git, all changes to evaluation files are versioned, providing a record of when results were added or modified and by whom. Discussions about reported scores can take place directly within pull request threads.

Hugging Face said the feature aims to address inconsistencies in reported benchmark results across papers, model cards, and evaluation platforms. While traditional benchmarks remain widely used, many have reached high saturation levels, and reported scores can vary depending on evaluation setups. By linking model repositories and benchmark datasets through reproducible specifications and visible submission histories, the new system seeks to make evaluation reporting more consistent and traceable.

Early reactions on X and Reddit were limited but largely positive. Users welcomed the move toward decentralized, transparent evaluation reporting, with some highlighting the value of community-submitted scores over single benchmark metrics.

AI and Tech Educator Himanshu Kumar commented:

Model evaluations need better standardization, and Hugging Face’s Community Evals could help with that.

Meanwhile user @rm-rf-rm shared:

The likes of LMArena have ruined model development and incentivized the wrong thing. I think this will go a long way in addressing that bad dynamic.

The company emphasized that Community Evals does not replace existing benchmarks or closed evaluation processes. Instead, it provides a mechanism to expose evaluation results already produced by the community and to make them accessible through Hub APIs. This could allow external tools to build dashboards, curated leaderboards, or comparative analyses using standardized data.

The feature is currently in beta. Developers can participate by adding YAML evaluation files to their model repositories or by registering dataset repositories as benchmarks with a defined evaluation specification. Hugging Face said it plans to expand the number of supported benchmarks and continue developing the system based on community feedback.