A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents

Wang, Weiling and Cooley, Isabel and Alexander, Morgan R. and Wildman, Ricky D. and Croft, Anna K. and Johnston, Blair F. (2026) A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents. Digital Discovery. ISSN 2635-098X (https://doi.org/10.1039/d5dd00456j)

[thumbnail of Wang-etal-DD-2026-A-case-study-on-hybrid-machine-learning-and-quantum-informed-modelling]
Preview
Text. Filename: Wang-etal-DD-2026-A-case-study-on-hybrid-machine-learning-and-quantum-informed-modelling.pdf
Final Published Version
License: Creative Commons Attribution 4.0 logo

Download (2MB)| Preview

Abstract

Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.

ORCID iDs

Wang, Weiling ORCID logoORCID: https://orcid.org/0000-0001-6111-6945, Cooley, Isabel, Alexander, Morgan R., Wildman, Ricky D., Croft, Anna K. and Johnston, Blair F. ORCID logoORCID: https://orcid.org/0000-0001-9785-6822;