Towards a Fault-aware Computing Environment
Authors: X.-H. Sun, Z. Lan, Y. Li, H. Jin, Z. Zheng
Date: March, 2008
Venue: The High Availability and Performance Computing Workshop (HAPCW)
Type: Workshop
Abstract
In this paper, we propose and present the design and initial development of the Fault awareness Enabled Computing Environment (FENCE) system for high end computing. FENCE is a comprehensive fault management system in the sense that it consists of both post and runtime analysis, integrates both proactive and reactive mechanisms, and combines both application level and system level fault management. Component-based systems are also developed to support the comprehensive FENCE design. Preliminary implementation results are presented.