Fault-Tolerance Techniques for High-Performance Computing

Fault-Tolerance Techniques for High-Performance Computing
Title Fault-Tolerance Techniques for High-Performance Computing PDF eBook
Author Thomas Herault
Publisher Springer
Pages 325
Release 2015-07-01
Genre Computers
ISBN 3319209434

Download Fault-Tolerance Techniques for High-Performance Computing Book in PDF, Epub and Kindle

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

New Software-based Fault Tolerance Methods for High Performance Computing

New Software-based Fault Tolerance Methods for High Performance Computing
Title New Software-based Fault Tolerance Methods for High Performance Computing PDF eBook
Author Robert D. Hunt
Publisher
Pages 0
Release 2015
Genre
ISBN

Download New Software-based Fault Tolerance Methods for High Performance Computing Book in PDF, Epub and Kindle

Software-Implemented Hardware Fault Tolerance

Software-Implemented Hardware Fault Tolerance
Title Software-Implemented Hardware Fault Tolerance PDF eBook
Author Olga Goloubeva
Publisher Springer Science & Business Media
Pages 238
Release 2006-09-19
Genre Technology & Engineering
ISBN 0387329374

Download Software-Implemented Hardware Fault Tolerance Book in PDF, Epub and Kindle

This book presents the theory behind software-implemented hardware fault tolerance, as well as the practical aspects needed to put it to work on real examples. By evaluating accurately the advantages and disadvantages of the already available approaches, the book provides a guide to developers willing to adopt software-implemented hardware fault tolerance in their applications. Moreover, the book identifies open issues for researchers willing to improve the already available techniques.

Fault Tolerance

Fault Tolerance
Title Fault Tolerance PDF eBook
Author Peter A. Lee
Publisher Springer Science & Business Media
Pages 326
Release 2012-12-06
Genre Computers
ISBN 370918990X

Download Fault Tolerance Book in PDF, Epub and Kindle

The production of a new version of any book is a daunting task, as many authors will recognise. In the field of computer science, the task is made even more daunting by the speed with which the subject and its supporting technology move forward. Since the publication of the first edition of this book in 1981 much research has been conducted, and many papers have been written, on the subject of fault tolerance. Our aim then was to present for the first time the principles of fault tolerance together with current practice to illustrate those principles. We believe that the principles have (so far) stood the test of time and are as appropriate today as they were in 1981. Much work on the practical applications of fault tolerance has been undertaken, and techniques have been developed for ever more complex situations, such as those required for distributed systems. Nevertheless, the basic principles remain the same.

Software Fault Tolerance Techniques and Implementation

Software Fault Tolerance Techniques and Implementation
Title Software Fault Tolerance Techniques and Implementation PDF eBook
Author Laura L. Pullum
Publisher Artech House
Pages 358
Release 2001
Genre Computers
ISBN 1580531377

Download Software Fault Tolerance Techniques and Implementation Book in PDF, Epub and Kindle

Look to this innovative resource for the most-comprehensive coverage of software fault tolerance techniques available in a single volume. It offers you a thorough understanding of the operation of critical software fault tolerance techniques and guides you through their design, operation and performance. You get an in-depth discussion on the advantages and disadvantages of specific techniques, so you can decide which ones are best suited for your work.

Fault Tolerance for Iterative Methods in High-performance Computing

Fault Tolerance for Iterative Methods in High-performance Computing
Title Fault Tolerance for Iterative Methods in High-performance Computing PDF eBook
Author Dingwen Tao
Publisher
Pages 154
Release 2018
Genre Cellular automata
ISBN 9780438429512

Download Fault Tolerance for Iterative Methods in High-performance Computing Book in PDF, Epub and Kindle

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems and fail-stop errors in the entire system, considering large component counts and lower power margins of emerging high-performance computing (HPC) platforms.

Software Performability: From Concepts to Applications

Software Performability: From Concepts to Applications
Title Software Performability: From Concepts to Applications PDF eBook
Author Ann T. Tai
Publisher Springer Science & Business Media
Pages 207
Release 2012-12-06
Genre Computers
ISBN 1461313252

Download Software Performability: From Concepts to Applications Book in PDF, Epub and Kindle

Computers are currently used in a variety of critical applications, including systems for nuclear reactor control, flight control (both aircraft and spacecraft), and air traffic control. Moreover, experience has shown that the dependability of such systems is particularly sensitive to that of its software components, both the system software of the embedded computers and the application software they support. Software Performability: From Concepts to Applications addresses the construction and solution of analytic performability models for critical-application software. The book includes a review of general performability concepts along with notions which are peculiar to software performability. Since fault tolerance is widely recognized as a viable means for improving the dependability of computer system (beyond what can be achieved by fault prevention), the examples considered are fault-tolerant software systems that incorporate particular methods of design diversity and fault recovery. Software Performability: From Concepts to Applications will be of direct benefit to both practitioners and researchers in the area of performance and dependability evaluation, fault-tolerant computing, and dependable systems for critical applications. For practitioners, it supplies a basis for defining combined performance-dependability criteria (in the form of objective functions) that can be used to enhance the performability (performance/dependability) of existing software designs. For those with research interests in model-based evaluation, the book provides an analytic framework and a variety of performability modeling examples in an application context of recognized importance. The material contained in this book will both stimulate future research on related topics and, for teaching purposes, serve as a reference text in courses on computer system evaluation, fault-tolerant computing, and dependable high-performance computer systems.