Blinded predictions and post-hoc analysis of the second solubility challenge data : exploring training data and feature set selection for machine and deep learning models
Conn, Jonathan G. M. and Carter, James W. and Conn, Justin J. A. and Subramanian, Vigneshwari and Baxter, Andrew and Engkvist, Ola and Llinas, Antonio and Ratkova, Ekaterina L. and Pickett, Stephen D. and McDonagh, James L and Palmer, David S. (2023) Blinded predictions and post-hoc analysis of the second solubility challenge data : exploring training data and feature set selection for machine and deep learning models. Journal of Chemical Information and Modeling, 63 (4). pp. 1099-1113. ISSN 1549-9596 (https://doi.org/10.1021/acs.jcim.2c01189)
Preview |
Text.
Filename: Conn_etal_JCIM_2023_Blinded_predictions_and_post_hoc_analysis_of_the_second_solubility_challenge_data.pdf
Final Published Version License: Download (5MB)| Preview |
Abstract
Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state-of-the-art, the American Chemical Society organised a “Second Solubility Challenge” in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019, but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms, and were trained on a relatively small dataset of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility datasets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge datasets, with the best model, a graph convolutional neural network, resulting in a RMSE of 0.86 log units. Critical analysis of the models reveal systematic di↵erences between the performance of models using certain feature sets and training datasets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy, but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modelling complex chemical spaces from sparse training datasets.
ORCID iDs
Conn, Jonathan G. M., Carter, James W., Conn, Justin J. A. ORCID: https://orcid.org/0000-0002-1772-1539, Subramanian, Vigneshwari, Baxter, Andrew, Engkvist, Ola, Llinas, Antonio, Ratkova, Ekaterina L., Pickett, Stephen D., McDonagh, James L and Palmer, David S. ORCID: https://orcid.org/0000-0003-4356-9144;-
-
Item type: Article ID code: 83974 Dates: DateEvent27 February 2023Published9 February 2023Published Online24 January 2023AcceptedSubjects: Technology > Chemical engineering
Science > Mathematics > Electronic computers. Computer scienceDepartment: Faculty of Science > Pure and Applied Chemistry Depositing user: Pure Administrator Date deposited: 02 Feb 2023 09:30 Last modified: 11 Nov 2024 13:46 Related URLs: URI: https://strathprints.strath.ac.uk/id/eprint/83974