Development of a novel imputation framework for PM2.5 particle data in Pakistani cities using machine learning and statistical techniques
Khan, Muhammad Asad and Pan, Jiazhu and Alshatti, Amani and Alsaber, Ahmad and Gray, Alison (2026) Development of a novel imputation framework for PM2.5 particle data in Pakistani cities using machine learning and statistical techniques. Frontiers in Environmental Science, 14. 1775982. (https://doi.org/10.3389/fenvs.2026.1775982)
Preview |
Text.
Filename: Khan-etal-FES-2026-Development-of-a-novel-imputation-framework-for-PM2-5-particle-data.pdf
Final Published Version License:
Download (4MB)| Preview |
Abstract
Introduction: Missing PM2.5 observations in environmental monitoring systems, caused by sensor malfunctions, communication failures, maintenance issues, and coverage gaps, compromise public health assessments and evidence-based air quality policymaking. Reliable imputation strategies are therefore essential to preserve data integrity and analytical validity. Methods: This study evaluated five imputation techniques: Bayesian Regression (BR), K-Nearest Neighbors (KNN), missForest, Predictive Mean Matching (PMM), and Random Forest (RF), using daily PM2.5 measurements collected between May 2019 and December 2024 from monitoring stations in Islamabad, Karachi, Lahore, and Peshawar, Pakistan. Three missing data mechanisms, MCAR, MAR, and MNAR, were simulated at missing rates ranging from 5% to 25%. Model performance was assessed using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). Results: Imputation under the MAR mechanism consistently yielded lower error values as missingness increased. Across all mechanisms and missing rates, missForest and KNN demonstrated superior performance. Notably, missForest achieved the lowest RMSE and MAE values overall and effectively preserved the temporal structure, range, and variability of the PM2.5 series. Discussion: The findings suggest that machine-learning-based approaches, particularly missForest, provide robust and reliable imputation for PM2.5 datasets with varying missingness patterns. These results support the use of missForest as a preferred method for handling incomplete air quality data in similar monitoring contexts, thereby strengthening the reliability of environmental health analyses and air quality policy development.
ORCID iDs
Khan, Muhammad Asad, Pan, Jiazhu
ORCID: https://orcid.org/0000-0001-7346-2052, Alshatti, Amani, Alsaber, Ahmad and Gray, Alison
ORCID: https://orcid.org/0000-0002-6273-0637;
-
-
Item type: Article ID code: 95611 Dates: DateEvent19 February 2026Published30 January 2026AcceptedSubjects: Geography. Anthropology. Recreation > Environmental Sciences
Science > Mathematics > Probabilities. Mathematical statisticsDepartment: Faculty of Science > Mathematics and Statistics Depositing user: Pure Administrator Date deposited: 20 Feb 2026 10:39 Last modified: 06 Mar 2026 09:30 URI: https://strathprints.strath.ac.uk/id/eprint/95611
Tools
Tools






