Skip to main content

PortHadoop: Support Direct HPC Data Processing in Hadoop

Authors: X. Yang, N. Liu, B. Feng, X.-H. Sun, S. Zhou

Date: October, 2015

Venue: IEEE International Conference on Big Data (IEEE BigData 2015). Santa Clara, CA, USA

Type: Conference

Abstract

The success of the Hadoop MapReduce program- ming model has greatly propelled research in big data analytics. In recent years, there is a growing interest in the High Per- formance Computing (HPC) community to use Hadoop-based tools for processing scientific data. This interest is due to the facts that data movement becomes prohibitively expensive, high- performance data analytic becomes an important part of HPC, and Hadoop-based tools can perform large-scale data processing in a time and budget efficient manner. In this study, we propose PortHadoop, an enhanced Hadoop architecture that enables MapReduce applications reading data directly from HPC parallel file systems (PFS). PortHadoop saves HDFS storage space, and, more importantly, avoids the otherwise costly data copying. PortHadoop keeps all the semantics in the original Hadoop system and PFS. Therefore, Hadoop MapReduce applications can run on PortHadoop without code change except that the input file location is in PFS rather than HDFS. Our experimental results show that PortHadoop can operate effectively and efficiently with the PVFS2 and Ceph file systems.