Garuda - Garba Rujukan Digital

International Journal of Engineering, Science and Information Technology

Vol 5, No 3 (2025)

Lulla, Karan (Unknown)
Chandra, Reena (Unknown)
Ranjan, Kishore (Unknown)

Publish Date
14 Jul 2025

The growing deployment of Graphics Processing Units (GPUs) across data centers, AI workloads, and cryptocurrency mining operations has elevated the importance of scalable, accurate, and real-time diagnostic mechanisms for hardware quality assurance (QA). Traditional factory QA processes are manual, time-consuming, and lack adaptability to subtle performance degradation. This study proposes an automated diagnostic pipeline that leverages publicly available GPU telemetry-like data, including hashrate, power draw, and efficiency metrics, to simulate factory-grade fault detection. Using the Kaggle “GPU Performance and Hashrate” dataset, we implement a machine learning-based framework combining XGBoost for anomaly classification and Long Short-Term Memory (LSTM) neural networks for temporal efficiency forecasting. Anomalies are heuristically labeled by identifying GPUs in the bottom 10% of the efficiency distribution, simulating fault flags. The XGBoost model achieves perfect accuracy on the test set with full interpretability via SHAP values, while the LSTM model captures degradation trends with low training loss and forecast visualizations. The framework is implemented in Google Colab to ensure accessibility and reproducibility. Diagnostic outputs include efficiency analysis, prediction overlays, and automated GPU health reports. Comparative results show higher efficiency variance in GeForce GPUs versus the more stable performance of data center models, highlighting hardware class differences. While limitations exist, such as reliance on simulated labels and static time windows, the study demonstrates the feasibility of ML-driven, scalable diagnostics using real-world data. This approach has direct applications in early fault detection, GPU fleet management, and embedded QA systems in both production and deployment environments.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

International Journal of Engineering, Science and Information Technology

Website

Abbrev

ijesty

Publisher

Universitas Malikussaleh

Subject

Astronomy Biochemistry, Genetics & Molecular Biology Chemical Engineering, Chemistry & Bioengineering Chemistry Civil Engineering, Building, Construction & Architecture Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management Earth & Planetary Sciences Education Electrical & Electronics Engineering Energy Engineering Industrial & Manufacturing Engineering Library & Information Science Materials Science & Nanotechnology Mathematics Mechanical Engineering Physics Social Sciences Transportation

Description

The journal covers all aspects of applied engineering, applied Science and information technology, that is: Engineering: Energy Mechanical Engineering Computing and Artificial Intelligence Applied Biosciences and Bioengineering Environmental and Sustainable Science and Technology Quantum Science and ...

Article Info

Abstract

Factory-Grade Diagnostic Automation for GeForce and Data Centre GPUs

Article Info

Abstract