Skip to main content

Reducing Fragmentation on Torus-Connected Supercomputers

Authors: W. Tang, Z. Lan, N. Desai, D. Buettner, Y. Yu

Date: May, 2011

Venue: The IEEE International Parallel and Distributed Processing Symposium (IPDPS' 11), Anchorage, AK, USA

Type: Conference

Abstract

Torus-based networks are prevalent on leadership-class petascale systems, providing a good balance between network cost and performance. The major disadvantage of this network architecture is its susceptibility to fragmentation. Many studies have attempted to reduce resource fragmentation in this architecture. Although the approaches suggested can make good allocation decisions reducing fragmentation at job start time, none of them considers a job's walltime, which can cause resource fragmentation when neighboring jobs do not complete closely. In this paper, we propose a walltime-aware job allocation strategy, which adjacently packs jobs that finish around the same time, in order to minimize resource fragmentation caused by job length discrepancy. Event-driven simulations using real job traces from a production Blue Gene/P system at Argonne National Laboratory demonstrate that our walltime-aware strategy can effectively reduce system fragmentation and improve overall system performance.