Researchers at the Shanghai Jiao Tong University have developed an innovative benchmark, named SciEval, specifically designed to address the existing limitations in evaluating the scientific capabilities of Large Language Models (LLMs). The groundbreaking work was conducted by a team of researchers, including Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu, who have detailed their findings in a comprehensive paper titled “SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research.”
In the paper, the authors highlighted the current limitations of existing benchmarks. They noted that the existing benchmarks are restricted to specific scientific disciplines, do not have evaluation systems dedicated to assessing scientific capabilities, rely exclusively on objective questions, and are plagued by the risk of data leakage. To overcome these significant limitations, the authors developed SciEval, a benchmark designed to provide a comprehensive and multi-disciplinary evaluation of LLMs. SciEval encompasses four key dimensions: basic knowledge, knowledge application, scientific calculation, and research ability. It includes a robust set of approximately 18,000 challenging scientific questions that span the fields of chemistry, physics, and biology, with each field being further divided into multiple sub-topics. Notably, the questions included in SciEval are of both objective and subjective types, a feature that sets it apart from other benchmarks. Additionally, the authors implemented dynamic data generation, a novel approach that effectively prevents potential data leakage, ensuring the fairness and credibility of the evaluation results.
The experiments conducted by the authors revealed insightful findings. While GPT-4 emerged as the strongest model among those evaluated, the results indicate that there is a substantial room for improvement across all models, particularly in the physics domain and in the analysis of experimental results. These findings underscore the necessity for continued research and development in this area. The authors expressed their hope that SciEval will serve as an effective and widely-adopted benchmark for assessing the scientific capabilities of LLMs, ultimately promoting their wide application in the scientific community. Those intrested can access the resources via the following link: SciEval on arXiv.