Journal of Technology Informatics and Engineering
Vol. 4 No. 3 (2025): DECEMBER | JTIE : Journal of Technology Informatics and Engineering

Multi-Horizon GPU Demand Forecasting with Workload Semantics and Operational Risk Curves: An Empirical Study on Alibaba Clusterdata GPU Trace

Siming Zhao (Business Analytics, Columbia University, NY, USA)
Jingwen Bai (Data Science, Columbia University, NY, USA)
Drew Roberson (Computer Science, Clemson University, SC, USA)



Article Info

Publish Date
20 Dec 2025

Abstract

This study addresses the operational challenge of multi-horizon GPU demand forecasting in large-scale computing clusters, where GPUs are costly resources and demand fluctuates under constraint-driven scheduling. The objective is to evaluate whether integrating workload semantics improves forecasting performance across horizons up to 72 hours. A reproducible empirical benchmark is developed using the Alibaba Clusterdata GPU trace (cluster-trace-gpu-v2023), comprising 8,152 pods over approximately 149 days with a total capacity of 6,212 GPUs. The study compares two statistical baselines, ARIMA(48,0,0) and a seasonal-trend additive model, with three lightweight deep learning models: Temporal Convolutional Network (TCN), Informer-lite, and TFT-lite. Workload semantics are approximated by converting hourly job metadata into textual summaries, embedding them with TF-IDF and truncated SVD (8 dimensions), and incorporating them as exogenous covariates. Evaluation uses SMAPE and MASE across multiple horizons (1–72 hours), along with peak-aware metrics and operational risk curves. Results show that the seasonal-trend model achieves the best overall accuracy (15.34% sMAPE), while TCN is the strongest deep model (17.20% sMAPE). Semantic embeddings do not improve short horizons (1–48 hours) but reduce 72-hour sMAPE by 11.1% and improve peak-window error. These findings indicate that autoregressive signals dominate short-term forecasting, whereas semantic context becomes beneficial at longer horizons. The study emphasizes that combining point accuracy with risk-based evaluation is essential for effective GPU capacity planning under dynamic and uncertain demand conditions.

Copyrights © 2025






Journal Info

Abbrev

jtie

Publisher

Subject

Computer Science & IT

Description

Power Engineering Telecommunication Engineering Computer Engineering Control and Computer Systems Electronics Information technology Informatics Data and Software engineering Biomedical ...