DTIO: Data Stack for AI-driven Workflows
Authors: K. Bateman, N. Rajesh, J. Cernuda, L. Logan, B. Nicolae, F. Cappello, X.-H. Sun, A. Kougkas
Date: June, 2025
Venue: The 37th International Conference on Scalable Scientific Data Management (SSDBM 2025)
Type: Conference
Abstract
HPC, Big Data Analytics, and Machine Learning have become in- creasingly intertwined as popular models such as LLMs and Diffusion Models have been driving discovery in scientific fields. However, each of these domains has its own storage infrastructure with unique I/O interfaces and storage systems, requiring feature sets that are often incompatible. Users with experience in one domain lack the expertise to change their applications to match the data stacks of the other domains, necessitating expensive conversions. There is a need for a transparent solution for the unification of disparate data stacks for the triple convergence of HPC, Big Data, and ML that can provide the required functionality while achieving higher performance. To better support converged HPC, Big Data, and ML workflows, this paper proposes DTIO, a scalable I/O runtime that unifies the disparate I/O stack for modern scientific ML workflows. DTIO utilizes a unique DataTask abstraction to express the move- ment of data, its ordering, and its dependencies on other data as a task. DTIO achieves a unification of scientific and ML workflows by utilizing intelligent mapping of interfaces, and automatically de- termines the best method to relate their unique semantics. DTIO's online translation with DataTask caching can improve performance by 49.6% compared to offline translation methods. DTIO also offers numerous optimizations, such as asynchronous I/O and aggregation.