Particulate Matter induced air pollution is known to have significant negative impacts on both the environment and human health. This research evaluates the effectiveness of various decision tree ensemble models in predicting daily PM10 concentrations in Thiruvananthapuram, Kerala, from July 2017 to December 2019. Seven decision tree ensemble models, namely Random Forest, Extra Trees, Gradient Boosting, AdaBoost, LightGBM, XGBoost, and Histogram-Based Gradient Boosting are employed here. To address missing data in the dataset, kNN imputation is utilized for a cohesive dataset suitable for model training. The models utilize both meteorological and air pollutant variables, with performance assessment using metrics such as the coefficient of determination (R²), root mean square error (RMSE) and mean absolute error (MAE). The findings indicate that the Extra Trees regression model provided the best prediction performance (R² = 0.9397, RMSE = 6.664 μg/m³, MAE = 4.950 μg/m³). Histogram-Based Gradient Boosting and Random Forest also demonstrate strong predictive capabilities. The explainability of the best prediction models is conducted by the feature importance analysis process. Feature importance analysis highlighted sulfur dioxide (SO2) as the most significant pollutant influencing PM10 levels, alongside meteorological factors like wind speed and rainfall, enhancing both prediction accuracy and interpretability of results. This research represents the first comprehensive effort to predict PM10 levels in Thiruvananthapuram using machine learning techniques, addressing a gap in regional air quality studies.
Copyrights © 2025