Apollo: An ML-assisted Real-Time Storage Resource Observer

Authors: N. Rajesh, H. Devarajan, J. Cernuda, K. Bateman, L. Logan, J. Ye, A. Kougkas, X.-H. Sun

Date: June, 2021

Venue: The 30th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC'21), June 21-25, 2021

Type: Conference

Abstract

Applications and middleware services, such as data placement en- gines, I/O scheduling, and prefetching engines, require low-latency access to telemetry data in order to make optimal decisions. However, typical monitoring services store their telemetry data in a database in order to allow applications to query them, resulting in significant latency penalties. This work presents Apollo: a low-latency mon- itoring service that aims to provide applications and middleware libraries with direct access to relational telemetry data. Monitoring the system can create interference and overhead, slowing down raw performance of the resources for the job. However, having a current view of the system can aid middleware services in making more optimal decisions which can ultimately improve the overall perfor- mance. Apollo has been designed from the ground up to provide low latency, using Publish–Subscribe (Pub-Sub) semantics, and low overhead, using adaptive intervals in order to change the length of time between polling the resource for telemetry data and machine learning in order to predict changes to the telemetry data between actual resource polling. This work also provides some high level abstractions called I/O curators, which can further aid middleware libraries and applications to make optimal decisions. Evaluations showcase that Apollo can achieve sub-millisecond latency for acquir- ing complex insights with a memory overhead of ~57MB and CPU overhead being only 7% more than existing state-of-the-art systems.

Links

Bibtex Citation Pdf

DOI: 10.1145/3431379.3460640

Abstract

Tags

Links