SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks
TLDRSciArena is presented, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks, and SciArena-Eval, a meta-evaluation benchmark based on collected preference data, which measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes.
How do people cite this paper?
(generated 20 days ago)SciArena has been referenced as evidence of the gap between automated judges and human expert preferences in scientific reasoning, as an example of pairwise ranking expanding into specialized domains with unexpected human feedback patterns, as a community-driven voting-based evaluation platform in surveys of LLM evaluation approaches, and as motivation for the challenges LLMs face in processing and reasoning over lengthy scientific documents.