Skip to main content

Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne

Authors: W. Allcock, P. Rich, Y. Fan, Z. Lan

Date: May, 2017

Venue: The 21st workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), Vancouver, Canada2017, pp. 1-24

Type: Workshop

Abstract

Abstract: The mission of the DOE Argonne Leadership Computing Facility (ALCF) is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. The ALCF operates supercomputers that are generally amongst the Top 5 fastest machines in the world. Specifically, ALCF is looking for the science that is either too big to run anywhere else, or it would take so long as to be impractical (i.e., "capability jobs"). At ALCF, batch scheduling plays a critical role for achieving a set of site goals within a set of constraints. While system utilization is an important goal at ALCF, its largest mission constraint is to enable extreme scale parallel jobs to take precedence. In this paper, we will describe the specific scheduling goals and constraints, analyze the workload traces collected in 2013-2017 from the 48-rack petascale supercomputer Mira, and discuss the upcoming scheduling challenges at ALCF.