Hadoop is an open source Apache project and a software framework for distributed processing of large datasets across large clusters of computers with commodity hardware. Large datasets include terabytes or petabytes of data where as large clusters means hundreds or thousands of nodes. It supports master slave architecture, which involves one master node and thousands of slave nodes. NameNode acts as the master node which stores all the metadata of files and various data nodes are slave nodes which stores all the application data. It becomes a bottleneck, when there is a need to process numerous number of small files because the NameNode utilizes the more memory to store the metadata of files and data nodes consume more CPU time to process numerous number of small files. This paper presents a novel technique to handle small file problems with Hadoop technology based on file merging, caching and correlation strategies. The experimental results shows that the proposed technique reduces the amount of data storage at NameNode, average memory usage of DataNodes and improves the access efficiency of small files in Hadoop Distributed File System up to 88.57% as compared with the general solution Hadoop Archive.
General TermsBig Data Analytics, Small files in Hadoop.
KeywordsHadoop, HDFS, Map Reduce, small files in Hadoop, small files storage in Hadoop.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.