There are significant differences among artificial intelligence large language models when answering scientific questions
Álvarez-Martínez, Francisco Javier and Esteban, Luis and Frungillo, Lucas and Butassi, Estefanía and Zambon, Alessandro and Herranz-López, María and Aranda, Mario and Pollastro, Federica and Tixier, Anne Sylvie and Garcia-Perez, Jose V and Arráez-Román, David and Ross, Andrew and Mena, Pedro and Edrada-Ebel, Ru Angelie and Lyng, James and Micol, Vicente and Borrás-Rocher, Fernando and Barrajón-Catalán, Enrique (2025) There are significant differences among artificial intelligence large language models when answering scientific questions. Frontiers in Artificial Intelligence, 8. 1664303. ISSN 2624-8212 (https://doi.org/10.3389/frai.2025.1664303)
Preview |
Text.
Filename: Alvarez-Martinez-FiAI-2025-significant-differences-among-artificial-intelligence-large-language-models-when-answering-scientific-questions.pdf
Final Published Version License:
Download (1MB)| Preview |
Abstract
Introduction: This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B. Methods: Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity. Results: Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers’ perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent. Discussion: The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.
ORCID iDs
Álvarez-Martínez, Francisco Javier, Esteban, Luis, Frungillo, Lucas, Butassi, Estefanía, Zambon, Alessandro, Herranz-López, María, Aranda, Mario, Pollastro, Federica, Tixier, Anne Sylvie, Garcia-Perez, Jose V, Arráez-Román, David, Ross, Andrew, Mena, Pedro, Edrada-Ebel, Ru Angelie
ORCID: https://orcid.org/0000-0003-2420-1117, Lyng, James, Micol, Vicente, Borrás-Rocher, Fernando and Barrajón-Catalán, Enrique;
-
-
Item type: Article ID code: 94631 Dates: DateEvent9 October 2025Published15 September 2025AcceptedSubjects: Science > Mathematics > Electronic computers. Computer science Department: Faculty of Science > Strathclyde Institute of Pharmacy and Biomedical Sciences Depositing user: Pure Administrator Date deposited: 04 Nov 2025 11:53 Last modified: 05 Dec 2025 18:50 URI: https://strathprints.strath.ac.uk/id/eprint/94631
Tools
Tools






