An Optimized Strategy for Small Files Storing and Accessing in HDFS

Lyu, Yanfeng; Fan, Xin; Liu, Kun

doi:10.1109/cse-euc.2017.112

Cited by 16 publications

(6 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By giving the notion of distribution and correlation when merging the files, Xun Cai et al [16] increased the access and storage efficiency of small files. Yanfeng Lyu et al [18] describe an efficient merging approach that considerably reduces the access time for small files by using the concepts of caching and prefetching. X. Fu et al [19] suggested a block replica placement technique for effectively processing small files, in which files are merged according to pre-determined parameters.…”

Section: Related Workmentioning

confidence: 99%

A Dynamic Repository Approach for Small File Management With Fast Access Time on Hadoop Cluster: Hash Based Extended Hadoop Archive

Sharma

Barwar

Afthanorhan³

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Small file processing in Hadoop is one of the challenging task. The performance of the Hadoop is quite good when dealing with large files, because they require lesser meta data and consume less memory. But while dealing with enormous amount of small files, meta data grows linearly and Name Node memory got overloaded hence overall performance of the Hadoop degrades. This paper presents a dual merge technique HB-EHA (Hash Based-Extended Hadoop Archive) that will resolve the small file issue of Hadoop and provide an excellent solution for massive small files that are generated in the health care management applications. The proposed technique merges the small files using two level compaction, therefore size of meta data at name node get reduced and less memory will be used. The indexing will be carried out over the archives and files can be accessed after merging in real time. Index files in the proposed approach can read partially that improves the name node memory usage and also offers the file appending capability in the existing archive. The proposed technique first creates Hadoop archive from the small files and then uses two special hash functions i.e. SSHF (Scalable-Splittable Hash Function) and HT-MMPHF (Hollow Trie Monotone Minimal Perfect Hash Function), SSHF is used to dynamically distribute the archives meta-data to the associated slave index files, and these slave index files will be further written to the final index files, the order of the meta-data in final index file will be preserved by the HT-MMPHF. The evaluation outcome exhibit that proposed technique is 13% & 17% faster than HDFS with caching enabled and disabled respectively, and 38% & 47% faster than the HAR with caching and without caching respectively. While comparing with map file, proposed technique is 28 & 35 time faster with caching and without caching respectively. HB-EHA is maximum 40% & 28% faster than the HBAF with and without caching respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

A Dynamic Repository Approach for Small File Management With Fast Access Time on Hadoop Cluster: Hash Based Extended Hadoop Archive

Sharma

Barwar

Afthanorhan³

et al. 2022

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Xun Cai et al [14] improved the access and storage efficiency of small files. Yanfeng Lyu et al [15] proposed an efficient merging method that substantially reduces the access time for small files by using caching and prefetching methods. X. Fu et al [16] proposed a block replica placement technique for effectively processing small files where files are merged as per the predetermined parameters.…”

Section: Literature Reviewmentioning

confidence: 99%

Optimization of Small Sized File Access Efficiency in Hadoop Distributed File System by Integrating Virtual File System Layer

Alange¹,

Mathur²

2022

IJACSA

View full text Add to dashboard Cite

Storage for large datasets, handling data in different formats and data getting generated with high speed are the major highlights of the Hadoop because of which the Hadoop got invented. Hadoop is the solution for the big data problems as discussed above. In order to give the improved solution (in terms of access efficiency and time) for small sized files, this solution is proposed. A novel approach called VFS-HDFS architecture is designed in which the focus is on optimization of small sized files access problems with significant development compared with the existing solutions i.e. HDFS sequence files, HAR, NHAR. In the proposed work a Virtual file system layer has been added as a wrapper over the top of existing HDFS architecture. However, the research work is carried out without altering the existing HFDS architecture. In this paper drawbacks of existing techniques i.e. Flat File Technique and Table Chain Technique which are implemented in HDFS HAR, NHAR, sequence file is overcome by using Bucket Chain Technique. The files to merge in a single bucket are selected using ensemble classifier which is a combination of different classifiers. Combination of multiple classifiers gives the better accurate results. Using this proposed system, better results are obtained compared with the existing system in terms of access efficiency of small sized files in HDFS.

show abstract

“…[16] [17] proposed Extended Hadoop Distributed File System (EHDFS) which has been designed and implemented in such a way that a large number of small files can be merged into a single combined file and it also provides a framework for prefetching metadata for a specified number of files. Yanfeng Lyu proposed an optimized strategy for small files storing and accessing in HDFS [18]. In their work, their method considers the size of small files when merging files into combine file, and generates a map record for each small file.…”

Section: Related Workmentioning

confidence: 99%

LHF: A New Archive based Approach to Accelerate Massive Small Files Access Performance in HDFS

Tao

Zhai

Tchaye-Kondi

2019

EasyChair Preprints

View full text Add to dashboard Cite

As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance.  A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.

show abstract

An Optimized Strategy for Small Files Storing and Accessing in HDFS

Cited by 16 publications

References 5 publications

A Dynamic Repository Approach for Small File Management With Fast Access Time on Hadoop Cluster: Hash Based Extended Hadoop Archive

A Dynamic Repository Approach for Small File Management With Fast Access Time on Hadoop Cluster: Hash Based Extended Hadoop Archive

Optimization of Small Sized File Access Efficiency in Hadoop Distributed File System by Integrating Virtual File System Layer

LHF: A New Archive based Approach to Accelerate Massive Small Files Access Performance in HDFS

Contact Info

Product

Resources

About