A Scalable Unified Fault Tolerance for High Performance Computing Environments

A Scalable Unified Fault Tolerance for High Performance Computing Environments
Title A Scalable Unified Fault Tolerance for High Performance Computing Environments PDF eBook
Author Kulathep Charoenpornwattana
Publisher
Pages 132
Release 2008
Genre Electronic data processing
ISBN

Download A Scalable Unified Fault Tolerance for High Performance Computing Environments Book in PDF, Epub and Kindle

Fault-Tolerance Techniques for High-Performance Computing

Fault-Tolerance Techniques for High-Performance Computing
Title Fault-Tolerance Techniques for High-Performance Computing PDF eBook
Author Thomas Herault
Publisher Springer
Pages 325
Release 2015-07-01
Genre Computers
ISBN 3319209434

Download Fault-Tolerance Techniques for High-Performance Computing Book in PDF, Epub and Kindle

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Fault Tolerance for Scalable Applications

Fault Tolerance for Scalable Applications
Title Fault Tolerance for Scalable Applications PDF eBook
Author Bernd Bieker
Publisher Peter Lang Gmbh, Internationaler Verlag Der Wissenschaften
Pages 0
Release 2003
Genre
ISBN 9783899759006

Download Fault Tolerance for Scalable Applications Book in PDF, Epub and Kindle

The usage of parallel or distributed systems offers the possibility to execute «grand challenge» problems. Due to the complexity of such high performance computing systems and the long execution times of todays simulations, the probability of a failure during a program run cannot be neglected. In this work fault tolerance - specificaly user-transparent checkpointing - is considered. Analysis is performed using simulations. Real implementations are deployed to verify results. The aim is to give an easy approximation on the overhead generated by checkpointing protocols. In addition, it is shown in which situations more complex checkpointing protocols are useful in contrast to very simple approaches.

Transparent Fault Tolerance for Job Healing in HPC Environments

Transparent Fault Tolerance for Job Healing in HPC Environments
Title Transparent Fault Tolerance for Job Healing in HPC Environments PDF eBook
Author
Publisher
Pages
Release 2004
Genre
ISBN

Download Transparent Fault Tolerance for Job Healing in HPC Environments Book in PDF, Epub and Kindle

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the faults can be handled proactively; (3) a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointing mechanism, which is combined with full checkpoints to explore the potential of reducing the overhead of checkpointing by performing fewer full checkpoints interspersed with multiple smaller incremental checkpoints. Second, for the job input data, transparent techniques are provided to improve the reliability, availability and performance of HPC I/O systems. In this area, the dissertation contributes (1) a mechanism for offline job input data reconstruction to ensure availability of job input data and to improve center-wide performance at no cost to job owners; (2) an approach to automatic recover job input data at run-time during failures by recovering staged data from an original source; and (3) ÃØâ'ƠÅ"just in timeÃØâ'ƠÂ replicatio.

Scalable Techniques for Fault Tolerant High Performance Computing

Scalable Techniques for Fault Tolerant High Performance Computing
Title Scalable Techniques for Fault Tolerant High Performance Computing PDF eBook
Author
Publisher
Pages 174
Release 2006
Genre
ISBN

Download Scalable Techniques for Fault Tolerant High Performance Computing Book in PDF, Epub and Kindle

As the number of processors in todayʹs parallel systems continues to grow, the mean-time-to-failure of these systems is becoming significantly shorter than the execution time of many parallel applications. It is increasingly important for large parallel applications to be able to continue to execute in spite of the failure of some components in the system. Todayʹs long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this research, we explore scalable techniques to tolerate a small number of process failures in large scale parallel computing. The goal of this research is to develop scalable fault tolerance techniques to help to make future high performance computing applications self-adaptive and fault survivable. The fundamental challenge in this research is scalability. To approach this challenge, this research (1) extended existing diskless checkpointing techniques to enable them to better scale in large scale high performance computing systems; (2) designed checkpoint-free fault tolerance techniques for linear algebra computations to survive process failures without checkpoint or rollback recovery; (3) developed coding approaches and novel erasure correcting codes to help applications to survive multiple simultaneous process failures. The fault tolerance schemes we introduce in this dissertation are scalable in the sense that the overhead to tolerate a failure of a fixed number of processes does not increase as the number of total processes in a parallel system increases. Two prototype examples have been developed to demonstrate the effectiveness of our techniques. In the first example, we developed a fault survivable conjugate gradient solver that is able to survive multiple simultaneous process failures with negligible overhead. In the second example, we incorporated our checkpoint-free fault tolerance technique into the ScaLAPACK/PBLAS matrix-matrix multiplication code to evaluate the overhead, survivability, and scalability. Theoretical analysis indicates that, to survive a fixed number of process failures, the fault tolerance overhead (without recovery) for matrix-matrix multiplication decreases to zero as the total number of processes (assuming a fixed amount of data per process) increases to infinity. Experimental results demonstrate that the checkpoint-free fault tolerance technique introduces surprisingly low overhead even when the total number of processes used in the application is small.

Transparent Fault Tolerance for Job Healing in HPC Environments

Transparent Fault Tolerance for Job Healing in HPC Environments
Title Transparent Fault Tolerance for Job Healing in HPC Environments PDF eBook
Author Chao Wang
Publisher
Pages 145
Release 2009
Genre
ISBN

Download Transparent Fault Tolerance for Job Healing in HPC Environments Book in PDF, Epub and Kindle

Keywords: job input data, fault tolerance, high-performance computing, fault resilience, checkpoint/restart.

A Proactive Fault Tolerance Framework for High Performance Computing (HPC) Systems in the Cloud

A Proactive Fault Tolerance Framework for High Performance Computing (HPC) Systems in the Cloud
Title A Proactive Fault Tolerance Framework for High Performance Computing (HPC) Systems in the Cloud PDF eBook
Author Ifeanyi Paulinus Egwutuoha
Publisher
Pages
Release 2014
Genre Cloud computing
ISBN

Download A Proactive Fault Tolerance Framework for High Performance Computing (HPC) Systems in the Cloud Book in PDF, Epub and Kindle