Apollo: An ML-assisted Real-Time Storage Resource Observer
Authors: N. Rajesh, H. Devarajan, J. Cernuda, K. Bateman, L. Logan, J. Ye, A. Kougkas, X.-H. Sun
Date: June, 2021
Venue: The 30th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC'21), June 21-25, 2021
Type: Conference
Abstract
Applications and middleware services, such as data placement en- gines, I/O scheduling, and prefetching engines, require low-latency access to telemetry data in order to make optimal decisions. However, typical monitoring services store their telemetry data in a database in order to allow applications to query them, resulting in significant latency penalties. This work presents Apollo: a low-latency mon- itoring service that aims to provide applications and middleware libraries with direct access to relational telemetry data. Monitoring the system can create interference and overhead, slowing down raw performance of the resources for the job. However, having a current view of the system can aid middleware services in making more optimal decisions which can ultimately improve the overall perfor- mance. Apollo has been designed from the ground up to provide low latency, using Publish–Subscribe (Pub-Sub) semantics, and low overhead, using adaptive intervals in order to change the length of time between polling the resource for telemetry data and machine learning in order to predict changes to the telemetry data between actual resource polling. This work also provides some high level abstractions called I/O curators, which can further aid middleware libraries and applications to make optimal decisions. Evaluations showcase that Apollo can achieve sub-millisecond latency for acquir- ing complex insights with a memory overhead of ~57MB and CPU overhead being only 7% more than existing state-of-the-art systems.