Skip to main content

IC-Data: Improving Compressed Data Processing in Hadoop

Authors: A. Haider, X. Yang, N. Liu, S. He, X.-H. Sun

Date: December, 2015

Venue: 22nd annual IEEE International Conference on High Performance Computing (HiPC 2015), Bengaluru, India

Type: Conference

Abstract

As dataset sizes for data analytic applications and scientific applications running on Hadoop increases, data com- pression has become essential to store this data within a rea- sonable storage cost. Although data is often stored compressed, currently Hadoop takes 49% longer to process compressed data compared to uncompressed data. Processing compressed data reduces the amount of task parallelism and creates uneven workload distribution both of which are fundamental issues the MapReduce parallel programming paradigm should alleviate. In this paper, we propose the design and implementation of a Net- work Overlapped Compression scheme, NOC, and Compression Aware Storage scheme, CAS. NOC reduces data load time and hides compression overhead by interleaving network I/O with compression. CAS increases parallelism by dynamically changing a file's block size based on compression ratio. Additionally, we develop a MapReduce Module which recognizes the characteris- tics of compressed data to improve resource allocation and load balance. Collectively, NOC, CAS, and the MapReduce Module decrease job execution time on average by 66% and data load time by 31%.