Journal of Applied Data Sciences
Vol 6, No 4: December 2025

Air Pollution Forecasting in Almaty using Ensemble Machine Learning Models

Naizabayeva, Lyazat (Unknown)
Sembina, Gulbakyt (Unknown)
Aliman, Alibek (Unknown)
Satymbekov, Maxatbek (Unknown)
Barlykbay, Nazym (Unknown)
Seilova, Nurgul (Unknown)



Article Info

Publish Date
13 Sep 2025

Abstract

This study develops an advanced forecasting methodology for air pollution levels in Almaty, Kazakhstan, focusing on fine Particulate Matter (PM2.5) and carbon monoxide concentrations. Air pollution poses significant risks to public health, and Almaty’s basin location exacerbates the problem. Addressing the limitations of traditional statistical forecasting methods, we propose an ensemble machine learning approach that integrates Seasonal-Trend decomposition with gradient boosting algorithms to capture complex temporal and nonlinear patterns. The objective is to develop and validate an effective methodology for forecasting atmospheric air pollution in Almaty using machine learning methods, in particular STL decomposition, XGBoost, LightGBM models, and their ensemble combination. The novelty lies in the integration of STL decomposition with an ensemble of gradient boosting models for high-accuracy air pollution forecasting in the complex urban environment of Almaty. The dataset includes hourly measurements from over 20 monitoring stations, enabling seasonal and spatial analysis. Rigorous preprocessing techniques were applied, including outlier removal, normalization, and time series decomposition into seasonal, trend, and residual components. Two gradient boosting models, XGBoost and LightGBM, were trained separately and combined into a weighted ensemble, with optimal weights determined through cross-validation. Figures and tables illustrate data preprocessing flow, model architectures, feature importance analysis, and evaluation of predictive performance. The ensemble outperformed individual models, achieving high accuracy with coefficient of determination values exceeding 0.98 for PM2.5 and 0.83 for carbon monoxide. The findings demonstrate that integrating Seasonal-Trend decomposition with ensemble learning provides a robust and effective approach to forecasting air pollution in complex urban environments. The methodology shows strong potential for practical application in real-time air quality monitoring and warning systems, aiding policymakers and public health authorities. Future research will expand the dataset by incorporating additional factors such as traffic flow, industrial emissions, and satellite remote sensing data to enhance predictive accuracy and model interpretability.

Copyrights © 2025






Journal Info

Abbrev

JADS

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...