Bulletin of Electrical Engineering and Informatics
Vol 11, No 2: April 2022

Scalable epidemic message passing interface fault tolerance

Soma Sekhar Kolisetty (VIT-AP University)
Battula Srinivasa Rao (VIT-AP University)



Article Info

Publish Date
01 Apr 2022

Abstract

Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC) and extreme scale systems. Components fail more often in such systems, results in application abort. Adopting fault–tolerance techniques can be consistently detect failures and continue application’s execution even if the failures exist. A prominent parallel programming specification, message passing interface (MPI), as it would be used to implement failure detection and consensus algorithm in this paper. Although the MPI does not facilitate fault tolerant behavior, this work presents a fault tolerant, matrix based failure detection and consensus algorithm. The proposed algorithm uses Gossiping. To detect failures, randomised pinging will be applied during the execution of the algorithm by using piggybacked gossip messages. In order to achieve consensus on the failures in the system, failed processes’ information will be sent using the same piggybacked gossip messages to all the alive processes. The algorithm was implemented in MPI framework and is completely fault tolerant. The results exhibit all the MPI process failures were detected using randomised pinging and global consensus has achieved on failed MPI process in the system.

Copyrights © 2022






Journal Info

Abbrev

EEI

Publisher

Subject

Electrical & Electronics Engineering

Description

Bulletin of Electrical Engineering and Informatics (Buletin Teknik Elektro dan Informatika) ISSN: 2089-3191, e-ISSN: 2302-9285 is open to submission from scholars and experts in the wide areas of electrical, electronics, instrumentation, control, telecommunication and computer engineering from the ...