A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents
Wang, Weiling and Cooley, Isabel and Alexander, Morgan R. and Wildman, Ricky D. and Croft, Anna K. and Johnston, Blair F. (2026) A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents. Digital Discovery. ISSN 2635-098X (https://doi.org/10.1039/d5dd00456j)
Preview |
Text.
Filename: Wang-etal-DD-2026-A-case-study-on-hybrid-machine-learning-and-quantum-informed-modelling.pdf
Final Published Version License:
Download (2MB)| Preview |
Abstract
Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.
ORCID iDs
Wang, Weiling
ORCID: https://orcid.org/0000-0001-6111-6945, Cooley, Isabel, Alexander, Morgan R., Wildman, Ricky D., Croft, Anna K. and Johnston, Blair F.
ORCID: https://orcid.org/0000-0001-9785-6822;
-
-
Item type: Article ID code: 95243 Dates: DateEvent7 January 2026Published7 January 2026Published Online19 December 2025Accepted10 October 2025SubmittedSubjects: Medicine > Pharmacy and materia medica Department: Faculty of Science > Strathclyde Institute of Pharmacy and Biomedical Sciences
Technology and Innovation Centre > Continuous Manufacturing and Crystallisation (CMAC)Depositing user: Pure Administrator Date deposited: 09 Jan 2026 14:48 Last modified: 23 Jan 2026 12:44 URI: https://strathprints.strath.ac.uk/id/eprint/95243
Tools
Tools






