Skip to main content

Towards a Fault-aware Computing Environment

Authors: X.-H. Sun, Z. Lan, Y. Li, H. Jin, Z. Zheng

Date: March, 2008

Venue: The High Availability and Performance Computing Workshop (HAPCW)

Type: Workshop

Abstract

In this paper, we propose and present the design and initial development of the Fault awareness Enabled Computing Environment (FENCE) system for high end computing. FENCE is a comprehensive fault management system in the sense that it consists of both post and runtime analysis, integrates both proactive and reactive mechanisms, and combines both application level and system level fault management. Component-based systems are also developed to support the comprehensive FENCE design. Preliminary implementation results are presented.

Links