Soma Sekhar Kolisetty
VIT-AP University

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Scalable epidemic message passing interface fault tolerance Soma Sekhar Kolisetty; Battula Srinivasa Rao
Bulletin of Electrical Engineering and Informatics Vol 11, No 2: April 2022
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v11i2.3374

Abstract

Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC) and extreme scale systems. Components fail more often in such systems, results in application abort. Adopting fault–tolerance techniques can be consistently detect failures and continue application’s execution even if the failures exist. A prominent parallel programming specification, message passing interface (MPI), as it would be used to implement failure detection and consensus algorithm in this paper. Although the MPI does not facilitate fault tolerant behavior, this work presents a fault tolerant, matrix based failure detection and consensus algorithm. The proposed algorithm uses Gossiping. To detect failures, randomised pinging will be applied during the execution of the algorithm by using piggybacked gossip messages. In order to achieve consensus on the failures in the system, failed processes’ information will be sent using the same piggybacked gossip messages to all the alive processes. The algorithm was implemented in MPI framework and is completely fault tolerant. The results exhibit all the MPI process failures were detected using randomised pinging and global consensus has achieved on failed MPI process in the system.
Efficient failure detection and consensus at extreme-scale systems Soma Sekhar Kolisetty; Battula Srinivasa Rao
International Journal of Electrical and Computer Engineering (IJECE) Vol 12, No 5: October 2022
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijece.v12i5.pp5339-5347

Abstract

Distributed systems and extreme-scale systems are ubiquitous in recent years and have seen throughout academia organizations, business, home, and government sectors. Peer-to-peer (P2P) technology is a typical distributed system model that is gaining popularity for delivering computing resources and services. Distributed systems try to increase its availability in the event of frequent component failures and functioning the system in such scenario is notoriously difficult. In order to identify component failures in the system and achieve global agreement (consensus) among failed components, this paper implemented an efficient failure detection and consensus algorithm based on fail-stop type process failures. The proposed algorithm is fault-tolerant to process failures occurring before and during the execution of the algorithm. The proposed algorithm works with the epidemic gossip protocol, which is a randomly generated paradigm of computation and communication that is both fault-tolerant and scalable. A simulation of an extreme-scale information dissemination process shows that global agreement can be achieved. A P2P simulator, PeerSim, is used in the paper to implement and test the proposed algorithm. The proposed algorithm results exhibited high scalability and at the same time detected all the process failures. The status of all the processes is maintained in a Boolean matrix.