Skip to main content

WisIO: Automated I/O Bottleneck Detection via Multi-Perspective Views for HPC Workloads

GRC-LED

Highly data-dependent HPC workloads create pressure on storage systems. With increasing storage diversity in modern HPC systems, complexity challenges users and leads to I/O bottlenecks. Earlier solutions combined I/O characteristics with expert insight, while recent approaches use performance analysis tools. However, the multifaceted problem, with numerous metrics, remains challenging for manual resolution, even for experts.

To this end, we introduce WisIO, an automated I/O bottleneck detection tool for HPC workloads. The key contributions of this work are:

  • Design of WisIO for automating I/O bottleneck detection
  • A novel approach to reducing the search space within I/O traces
  • A global and extensible rule engine for bottleneck detection rules

Motivation

Out-of-core I/O analysis queries in a memory-limited environment necessitate query and dataset optimization for distributed analysis.

Methodology

  • Execute HPC workloads, capture I/O traces via Darshan or Recorder
  • Convert and optimize I/O traces to Parquet format
  • Generate a multi-perspective view with I/O characteristics
  • Compute bottleneck severity scores and produce user-friendly text-based diagnoses through a rule-based engine

Use Case: Montage (Workflow with Complex Dependencies)

  • For efficiency, the process IDs are hashed with respect to their node addresses and hostnames
  • Allows us to analyze groups of process IDs effectively
  • 1280 ranks perform 3.2M read and 1.6M write operations
  • Average bandwidth is low (~3MB/s)
  • Multi-perspective analysis takes 42 seconds
  • Process-based analysis produces diagnoses with specific nodes or apps
  • 14 diagnoses produced with severity scores between 52-61.4%

Use Case: CM1 (Simulation with Separate I/O Phases)

  • Timestamps are converted into microseconds as indexing in Dask works faster with non-decimal values
  • For precision, the middle point of two timestamp is used in analysis instead of their range
  • 94.6% of I/O time is spent during the first 20 seconds
  • Rank 0 performs 100% of the writes on small files
  • Multi-perspective analysis takes 22 seconds
  • Time-based analysis produces diagnoses with specific time ranges
  • Above is an actual text-based diagnosis for CM1
  • 20 diagnoses produced with severity scores between 55-85%

Use Case: 1000 Genomes (Data-Intensive Workflow)

  • The filenames are hashed with respect to folder hierarchy
  • Allows us to analyze file directories effectively
  • There are 21m files and all of them are accessed file-per-process basis
  • Multi-perspective analysis takes 12 minutes
  • File-based analysis produces diagnoses with specific folder hierarchy
  • 30 diagnoses produced with severity scores between 54-67.1%

Publications

Authors
Title
Venue
Type
Date
Links
I. Yildirim,
H. Devarajan,
A. Kougkas,
X.-H. Sun,
K. Mohror
IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC SystemsThe 8th International Parallel Data Systems Workshop (PDSW'23), November 12, 2023WorkshopNovember, 2023
I. Yildirim,
H. Devarajan,
A. Kougkas,
X.-H. Sun,
K. Mohror
Exploring the Impacts of Multiple I/O Metrics in Identifying I/O BottlenecksThe International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'23), November 12-17, 2023PosterNovember, 2023
I. Yildirim,
H. Devarajan,
A. Kougkas,
X.-H. Sun,
K. Mohror
A Multifaceted Approach to Automated I/O Bottleneck Detection for HPC WorkloadsThe International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'22)PosterNovember, 2022

Key Takeaways

  • WisIO automates I/O bottleneck detection for HPC workloads
  • WisIO's novel search space reduction approach enable I/O analysis for large-scale I/O traces
  • WisIO's multi-perspective views can detect I/O bottlenecks that might be otherwise overlooked
  • WisIO's extensible rule engine allows users to define custom rules