Towards Optimizing Large-Scale Data Transfers with End-to-End Integrity Verification
Authors: S. Liu, E.-S. Jun, R. Kettimuthu, X.-H. Sun, M. Papka
Date: December, 2016
Venue: 4th International Workshop on Distributed Storage Systems and Coding for Big Data, in conjunction with IEEE BigData 2016. Washington, D.C., USA
Type: Workshop
Abstract
The scale of scientific data generated by experi- mental facilities and simulations on high-performance computing facilities has been growing rapidly. In many cases, this data needs to be transferred rapidly and reliably to remote facilities for storage, analysis, sharing etc. At the same time, users want to verify the integrity of the data by doing a checksum after the data has been written to disk at the destination, to ensure the file has not been corrupted, for example due to network or storage data corruption, software bugs or human error. This end- to-end integrity verification creates additional overhead (extra disk I/O and more computation) and increases the overall data transfer time. In this paper, we evaluate strategies to maximize the overlap between data transfer and checksum computation. More specifically, we evaluate file-level and block-level (with various block sizes) pipelining to overlap data transfer and checksum computation. We evaluate these pipelining approaches in the context of GridFTP, a widely used protocol for science data transfers. We conducted both theoretical analysis and real experiments to evaluate our methods. The results show that block-level pipelining is an effective method in maximizing the overlap between data transfer and checksum computation and can improve the overall data transfer time with end-to-end integrity verification by up to 70% compared to the sequential execution of transfer and checksum, and by up to 60% compared to file-level pipelining.