GeoEval : benchmark for evaluating LLMs and multi-modal models on geometry problem-solving

Tools

Zhang, Jiaxin and Li, Zhong Zhi and Zhang, Ming Liang and Yin, Fei and Liu, Cheng Lin and Moshfeghi, Yashar; Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek, eds. (2024) GeoEval : benchmark for evaluating LLMs and multi-modal models on geometry problem-solving. In: 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024. Proceedings of the Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics (ACL), THA, pp. 1258-1276. ISBN 9798891760998 (https://doi.org/10.18653/v1/2024.findings-acl.73)

[thumbnail of Zhang-etal-ACL-Benchmark-for-Evaluating-LLMs-and-Multi-Modal-Models-on-Geometry-Problem-Solving]

Preview

Text. Filename: Zhang-etal-ACL-Benchmark-for-Evaluating-LLMs-and-Multi-Modal-Models-on-Geometry-Problem-Solving.pdf
Final Published Version
License:

Download (5MB)| Preview

Abstract

Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67% accuracy rate on the main subset but only a 6.00% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.

ORCID iDs

Zhang, Jiaxin

, Li, Zhong Zhi, Zhang, Ming Liang, Yin, Fei, Liu, Cheng Lin and Moshfeghi, Yashar

; Ku, Lun-Wei, Martins, Andre and Srikumar, Vivek

Share and Export

Item metadata

Item type:	Book Section
ID code:	90796
Dates:	Date Event 31 August 2024 Published
Subjects:	Science > Mathematics > Electronic computers. Computer science
Department:	Faculty of Science > Computer and Information Sciences Faculty of Humanities and Social Sciences (HaSS) > Psychological Sciences and Health
Depositing user:	Pure Administrator
Date deposited:	09 Oct 2024 10:24
Last modified:	07 May 2025 16:12
Related URLs:	Scopus publication Journal or Publication https://github.com/ GeoEval/GeoEval
URI:	https://strathprints.strath.ac.uk/id/eprint/90796

CORE (COnnecting REpositories)