Claim Missing Document
Check
Articles

Found 1 Documents
Search

Towards Self-Healing Cloud Infrastructures: Predictive Maintenance with Reinforcement Learning and Generative Models Kunal Shah, Jyoti; Matam, Prashanthi
International Journal of Engineering, Science and Information Technology Vol 5, No 3 (2025)
Publisher : Malikussaleh University, Aceh, Indonesia

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.52088/ijesty.v5i3.1185

Abstract

Reinforcement Learning (RL) is quickly becoming a powerful way to predict failures and improve systems in large cloud environments before they happen. Unlike traditional reactive methods, RL lets smart agents learn the best actions by interacting with changing environments and using reward signals to improve system uptime, resource use, and reliability. As cloud-based big data systems get bigger and more complicated, they also become more likely to have problems that slow them down or cause them to fail at random times. To deal with these problems, we need more than just advanced failure prediction algorithms. We also need adaptive, explainable systems that help people understand what's going on and step in when necessary. This paper looks into how to use RL to help predict and manage failures in cloud-based big data systems. We suggest a layered architecture that uses RL agents and generative explanation models to predict failures and take steps to stop them. We focus on real-time feedback loops, autonomous learning, and outputs that can be understood. This is especially important in anomaly detection pipelines, where explanations need to be detailed but short. We show how reinforcement learning agents can find patterns of risk and take steps to avoid them by using examples from real-world hyperscale data centers. We also look at how generative models, like transformer-based language generators, can turn complicated telemetry data into information that people can understand. At the end of the paper, the authors suggest areas for future research, such as safe RL deployment, multi-agent coordination, and explainable policy design.