SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

Yilun Zhao·Kaiyan Zhang·Tiansheng Hu...Joseph Chee Chang...

NeurIPS·2025·12 citations🏆 Spotlight

TLDRSciArena is presented, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks, and SciArena-Eval, a meta-evaluation benchmark based on collected preference data, which measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes.

Code & Resources

Code: github.com/yale-nlp/SciArena

How do people cite this paper?

(generated 3 months ago)

SciArena has been referenced as evidence of the gap between automated judges and human expert preferences in scientific reasoning, as an example of pairwise ranking expanding into specialized domains with unexpected human feedback patterns, as a community-driven voting-based evaluation platform in surveys of LLM evaluation approaches, and as motivation for the challenges LLMs face in processing and reasoning over lengthy scientific documents.

Mentions

Ai2 Blog: SciArena: A new platform for evaluating foundation models in scientific literature tasks

Loading PDF...

Loading PDF reader...