Skip to main content

Chronolog: A High-Performance Storage Infrastructure for Activity and Log Workloads

GRC-LEDOPEN SOURCE

HPC applications generate more data than storage systems can handle, and it is becoming increasingly important to store activity (log) data generated by people and applications. ChronoLog is a hierarchical, distributed log store that leverages physical time to achieve log ordering and reduce contention while utilizing storage tiers to elastically scale the log capacity.

Background

  • Modern application domains in science and engineering, from astrophysics to web services and financial computations generate massive amounts of data at unprecedented rates (reaching up to 7 TB/s).

  • The promise of future data utility and the low cost of data storage caused by recent hardware innovations (less than $0.02/GB) is driving this data explosion resulting in widespread data hoarding from researchers and engineers.

This trend stresses existing storage systems past their capability and exceeds the capacity of even the largest computing systems, and it is becoming increasingly important to store and process activity (log) data generated by people and applications. Including:

  • Scientific Applications
  • Internet Companies
  • Financial Applications
  • Microservices and Containers
  • IoT
  • Task-based Computing

Distributed log stores require:

  • Total Ordering
  • High Concurrency and Parallelism
  • Capacity Scaling

Synopsis

This project will design and implement ChronoLog, a distributed and tiered shared log storage ecosystem. ChronoLog uses physical time to distribute log entries while providing total log ordering. It also utilizes multiple storage tiers to elastically scale the log capacity (i.e., auto-tiering). ChronoLog will serve as a foundation for developing scalable new plugins, including a SQL-like query engine for log data, a streaming processor leveraging the time-based data distribution, a log-based key-value store, and a log-based TensorFlow module.

Workloads & Applications

Modern applications spanning from Edge to High Performance Computing (HPC) systems, produce and process log data and create a plethora of workload characteristics that rely on a common storage model:

The Distributed Shared Log:

Distributed Shared Log

Challenges

  • Ensuring total ordering of log data in a distributed environment is expensive
    • Single point of contention (tail of the log)
    • High cost of synchronization
    • Centralized sequencers
  • Efficiently scale log capacity
    • Time- or space-based data retention policies
    • Add servers offline and rebalance cluster
  • Highly concurrent log operations by multiple clients
    • Single-writer-multiple-readers (SWMR) data access model
    • Limited operation concurrency
  • I/O parallelism
    • Application-centric implicit parallel model (e.g., consumer groups)
  • Partial data retrieval (log querying)
    • Expensive auxiliary indices
    • Metadata look-ups
    • Client-side epochs
FeaturesBookkeeper Kafka/DLogCorfu SloG/ZLogChronoLog
Locating the log-tailMDM lookup (locking)Sequencer (locking)MDM lookup (lock-free)
I/O isolationYesNoYes
I/O parallelism (readers-to-servers)1-to-11-to-NM-to-N (always)
Storage elasticity (scaling capacity)Only horizontalNoVertical and horizontal
Log hot zonesYes (active ledger)NoNo
Log capacityData retentionLimited by # of SSDsInfinite (auto-tiering)
Operation parallelismOnly Read (implicit)Write/ReadWrite/Read
Granularity of data distributionClosed Ledgers (log-partition)SSD page (set of entries)Event (per entry)
Log total orderingNo (only on partitions)Yes (eventually)Yes
Log entry visibilityImmediateEnd of epochImmediate
Storage overhead per entryYes (2x)NoNo
Tiered storageNoNoYes

Key Insights & Project Impact

Research Contributions

  • How physical time can be used to distribute and order log data without the need for explicit synchronizations or centralized sequencing offering a high-performance and scalable shared log store with the ability to support efficient range data retrieval.
  • How multi-tiered storage can be used to scale the capacity of a log and offer tunable data access parallelism while maintaining I/O isolation and a highly concurrent access model supporting multiple-writers-multiple-readers (MWMR).
  • How elastic storage semantics and dynamic resource provisioning can be used to achieve an efficient matching between I/O production and consumption rates of conflicting workloads under a single system.

Software Contributions

ChronoLog will create a future-proof storage infrastructure targeting large-scale systems and applications.

  • The ChronoLog Core Library
  • An in-memory lock-free distributed journal with the ability to persist data on flash storage to enable a fast distributed caching layer
  • A new high-performance data streaming service specifically, but not limited, for HPC systems that uses MPI for job distribution and RPC over RDMA (or over RoCE) for communication
  • A set of high-level interfaces for I/O

Chronolog Software

Design Requirements

Architecture

ChronoLog is a new class of a distributed shared log store that will leverage physical time for achieving total ordering of log entries and will utilize multiple storage tiers to distribute a log both horizontally and vertically (a model we call 3D data distribution).

Data Model & API

Design Details

Chronolog Design

  • The ChronoVisor
    • Handles client connections
    • Holds chronicle metadata information
    • Acts as the global clock enforcing time synchronization between all server nodes
    • Deployed on its own server node (usually a head node)
  • The ChronoKeeper
    • Serves all tail operations such as record() (append) and playback() (tail-read)
    • Stores incoming events in a distributed journal
    • Deployed on all or a subset of compute nodes that are equipped with a fast flash storage
  • The ChronoStore
    • Manages both intermediate storage resources (e.g., burst buffers or data staging resources) and the storage servers
    • Has the ability to grow or shrink its resources offering an elastic solution that can react to the I/O traffic
    • Organized into the ChronoGrapher and ChronoPlayer, which are responsible for writes and reads, respectively
  • The ChronoGrapher
    • Continuously ingests events from the ChronoKeeper
    • Uses a real-time data streaming approach to persist events to lower tiers of the hierarchy
  • The ChronoPlayer
    • Serves historical reads in the form of replay() (catch-up read) calls

Features & Operations

ChronoKeeper

Log Auto-Tiering

To provide chronicle capacity scaling, ChronoLog moves data down to the next larger but slower tiers automatically. To achieve this, the ChronoGrapher offers a very fast distributed data flushing solution, that can match the event production rate, by offering:

  • Real-time continuous data flushing
  • Tunable parallelism via resource elasticity
  • Storage device-aware random access
  • A decoupled server-pull, instead of a client-push, eviction model

ChronoGrapher

ChronoGrapher runs a data streaming job with three major steps represented as a DAG: event collection, story building, and story writing. Each node in the DAG is dynamic and elastic based on the incoming traffic estimated by the number of events and total size of data.

Log Querying & Range Retrieval

The ChronoPlayer is responsible for executing historical read operations.

ChronoPlayer

  • Initialized by the ChronoVisor upon system initialization
  • Accesses data from all available tiers
  • Implemented by a data streaming approach
  • Real-time, decoupled, and elastic architecture

Evaluation Results

Publications

Authors
Title
Venue
Type
Date
Links
H. Devarajan,
A. Kougkas,
H. Zheng,
V. Vishwanath,
X.-H. Sun
Stimulus: Accelerate Data Management for Scientific AI applications in HPCThe 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID'22), May 16-19, 2022ConferenceMay, 2022
J. Cernuda,
H. Devarajan,
L. Logan,
K. Bateman,
N. Rajesh,
J. Ye,
A. Kougkas,
X.-H. Sun
HFlow: A Dynamic and Elastic Multi-Layered Data ForwarderThe 2021 IEEE International Conference on Cluster Computing (CLUSTER'21), September 7-10, 2021ConferenceSeptember, 2021
N. Rajesh,
H. Devarajan,
J. Cernuda,
K. Bateman,
L. Logan,
J. Ye,
A. Kougkas,
X.-H. Sun
Apollo: An ML-assisted Real-Time Storage Resource ObserverThe 30th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC'21), June 21-25, 2021ConferenceJune, 2021
H. Devarajan,
H. Zheng,
A. Kougkas,
X.-H. Sun,
V. Vishwanath
DLIO: A Data-Centric Benchmark for Scientific Deep Learning ApplicationsThe 2021 IEEE/ACM International Symposium in Cluster, Cloud, and Internet Computing (CCGrid'21), May 17 - 20, 2021 Best paper awardConferenceMay, 2021
H. Devarajan,
A. Kougkas,
X.-H. Sun
HReplica: A Dynamic Data Replication Engine with Adaptive Compression for Multi-Tiered StorageThe 2020 IEEE International Conference on Big Data (Big Data'20), December 10-13, 2020ConferenceDecember, 2020
H. Devarajan,
A. Kougkas,
X.-H. Sun
A Dynamic Multi-Tiered Storage System for Extreme Scale ComputingThe International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20)PosterNovember, 2020
H. Devarajan,
A. Kougkas,
K. Bateman,
X.-H. Sun
HCL: Distributing Parallel Data Structures in Extreme ScalesIEEE International Conference on Cluster Computing (CLUSTER'20), Sept. 14-17, 2020ConferenceSeptember, 2020
H. Devarajan,
A. Kougkas,
X.-H. Sun
HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage EnvironmentsIEEE International Parallel and Distributed Processing Symposium (IPDPS'20), May 18-22, 2020ConferenceMay, 2020
H. Devarajan,
A. Kougkas,
L. Logan,
X.-H. Sun
HCompress: Hierarchical Data Compression for Multi-Tiered Storage EnvironmentsIEEE International Parallel and Distributed Processing Symposium (IPDPS'20), May 18-22, 2020ConferenceMay, 2020
A. Kougkas,
H. Devarajan,
X.-H. Sun
I/O Acceleration via Multi-Tiered Data Buffering and PrefetchingJournal of Computer Science and Technology (JCST'20), vol 35. no 1. pp 92-120JournalJanuary, 2020
H. Devarajan,
A. Kougkas,
X.-H. Sun
HFetch: Hierarchical Data Prefetching in Multi-Tiered Storage EnvironmentsThe International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19) Best Poster Nominee, Ph.D ForumPosterNovember, 2019
A. Kougkas,
H. Devarajan,
J. Lofstead,
X.-H. Sun
LABIOS: A Distributed Label-Based I/O SystemThe 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'19), Phoenix, USA 2019. pp. 13-24. Karsten Schwan Best Paper AwardConferenceJune, 2019
H. Devarajan,
A. Kougkas,
X.-H. Sun
An Intelligent, Adaptive, and Flexible Data Compression FrameworkIEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGrid'19), Larnaca, Cyprus2019. pp. 82-91.ConferenceMay, 2019
H. Devarajan,
A. Kougkas,
P. Challa,
X.-H. Sun
Vidya: Performing Code-Block I/O Characterization for Data Access OptimizationThe IEEE International Conference on High Performance Computing, Data, and Analytics 2018 (HiPC'18), Bengaluru, India2018. pp. 255-264.ConferenceDecember, 2018
A. Kougkas,
H. Devarajan,
X.-H. Sun,
J. Lofstead
Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst BuffersThe IEEE International Conference on Cluster Computing 2018 (Cluster'18), Belfast, UK2018. pp. 290-301.ConferenceSeptember, 2018
A. Kougkas,
H. Devarajan,
X.-H. Sun
Hermes: A Heterogeneous-Aware Multi-Tiered Distributed I/O Buffering SystemThe 27th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Tempe, AZ, USA, 2018. pp. 219-230ConferenceJune, 2018

Members

Collaborators

National Science Foundation (NSF CSSI-2104013)