Skip to main content

An Evaluation of DAOS for Simulation and Deep Learning HPC Workloads

Authors: L. Logan, J. Lofstead, A. Kougkas, X.-H. Sun

Date: June, 2024

Venue: SIGOPS Operating Systems Review (OSR'24)

Type: Journal

Abstract

Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with stor- age hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hard- ware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evalua- tion of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for mod- ern hardware. We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.

Tags

Distributed ComputingDistributed StorageFlash MemoryMachine LearningParallel ComputingPhase Change Memory