Skip to main content

Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints

Authors: Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, N. Desai

Date: November, 2016

Venue: IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 27, no. 11, pp. 3269-3282

Type: Journal

Abstract

Parallel file systems (PFS) are widely-used to ease the I/O bottleneck of modern high-performance computing systems. However, PFSs do not work well for small requests, especially small random requests. Newer Solid State Drives (SSD) have excellent performance on small random data accesses, but also incur a high monetary cost. In this study, we propose SLA-Cache, a Selective and Layout-Aware Cache system that employs a small set of SSD-based file servers as a cache of conventional HDD-based file servers. SLA-Cache uses a novel scheme to identify performance-critical data, and conducts a selective cache admission (SCA) policy to fully utilize SSD-based file servers. Moreover, since data layout of the cache system can also largely influence its access performance, SLA-Cache applies a layout-aware cache placement scheme (LCP) to store data on SSD-based file servers. By storing data with an optimal layout requiring the lowest access cost among three typical layout candidates, LCP can further improve system performance. We have implemented SLA-Cache under the MPICH2 I/O library. Experimental results show that SLA-Cache can significantly improve I/O throughput, and is a promising approach for parallel applications.