An Efficient Approach for Storing and Accessing Small Files with Big Data Technology

Gupta, Bharti; Nath, Rajender; Gopal, Girdhar; Kartik,

doi:10.5120/ijca2016910611

Cited by 15 publications

(8 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The metadata file, contains the meta information for the L-Store, the connector file contains index entries for mapping identifiers to identifiers, and the data file contains the medical fragment, corresponding to each identifier.In this way we can store a large amount of data in relatively smaller number of files. This strategy enables the most preferred way of data processing using MapReduce operations, with small number of large sized files [ 27 , 28 ]. As a result of this process, the UHPr is able to achieve transactional consistency.…”

Section: Ubiquitous Health Profile (Uhpr)mentioning

confidence: 99%

Ubiquitous Health Profile (UHPr): a big data curation platform for supporting health data interoperability

et al. 2020

View full text Add to dashboard Cite

The lack of Interoperable healthcare data presents a major challenge, towards achieving ubiquitous health care. The plethora of diverse medical standards, rather than common standards, is widening the gap of interoperability. While many organizations are working towards a standardized solution, there is a need for an alternate strategy, which can intelligently mediate amongst a variety of medical systems, not complying with any mainstream healthcare standards while utilizing the benefits of several standard merging initiates, to eventually create digital health personas. The existence and efficiency of such a platform is dependent upon the underlying storage and processing engine, which can acquire, manage and retrieve the relevant medical data. In this paper, we present the Ubiquitous Health Profile (UHPr), a multi-dimensional data storage solution in a semi-structured data curation engine, which provides foundational support for archiving heterogeneous medical data and achieving partial data interoperability in the healthcare domain. Additionally, we present the evaluation results of this proposed platform in terms of its timeliness, accuracy, and scalability. Our results indicate that the UHPr is able to retrieve an error free comprehensive medical profile of a single patient, from a set of slightly over 116.5 million serialized medical fragments for 390,101 patients while maintaining a good scalablity ratio between amount of data and its retrieval speed.

show abstract

Section: Ubiquitous Health Profile (Uhpr)mentioning

confidence: 99%

Ubiquitous Health Profile (UHPr): a big data curation platform for supporting health data interoperability

et al. 2020

View full text Add to dashboard Cite

show abstract

“…HAR is an excellent exit to avoid Namenode metadata jam, HAR file gives an option to access the files directly inside it as well as the creation steps done in easy commands, but on the other side, as a shortage, HAR cannot be altered after being created, cannot add more files, or delete some unwanted files from it. The bumpiest shortage in HAR is every file inside it requires 2 index files (Master index, Index) to read [23], that's mean reading a file from HDFS it self is much easier than reading from HAR file. Another limitation of HAR is the memory; HAR files puts extra pressure on the file system due to generate a copy of the original files and takes space as much as they need [23].…”

Section: Solutionsmentioning

confidence: 99%

“…The bumpiest shortage in HAR is every file inside it requires 2 index files (Master index, Index) to read [23], that's mean reading a file from HDFS it self is much easier than reading from HAR file. Another limitation of HAR is the memory; HAR files puts extra pressure on the file system due to generate a copy of the original files and takes space as much as they need [23]. -nHAR file: new Hadoop Archive is a correction of HAR file, its almost the same story but with some differences in architecture, first different is nHAR file needs only one index file to read, the second different is nHAR can be edit, you can add more files to the archive after create it.…”

Section: Solutionsmentioning

confidence: 99%

“…The reducer will merge the files [28]. -Map File: Actually it's a two files, data file and index file, the data file consists all pairs <key,value>, and the key information are stored in index file, however, both of these two files are sequence files (Ahad and Biswas 2018), and acting like sequence file but map file doesn't looking for a entire file when search by key [23]. -Extended HDFS: EHDFS is another successful technique based on altering HDFS itself, however, all files are stored in a file called combine file in order to reduce the time in accessing files and reduce the load on Namenode, EHDFS stands on advance four indexing technique and prefetching of index information are file merging, file mapping, file prefetching and file extraction, the first file merging, file mapping and file prefetching are carried out by Namenode while the last one file extraction is carried out by Datanode [24].…”

Section: Solutionsmentioning

confidence: 99%

See 1 more Smart Citation

Available techniques in hadoop small file issue

Masadeh

Azmi

Ahmad

2020

IJECE

View full text Add to dashboard Cite

Hadoop is an optimal solution for big data processing and storing since being released in the late of 2006, hadoop data processing stands on master-slaves manner [1] that’s splits the large file job into several small files in order to process them separately, this technique was adopted instead of pushing one large file into a costly super machine to insights some useful information. Hadoop runs very good with large file of big data, but when it comes to big data in small files it could facing some problems in performance, processing slow down, data access delay, high latency and up to a completely cluster shutting down [2]. In this paper we will high light on one of hadoop’s limitations, that’s affects the data processing performance, one of these limits called “big data in small files” accrued when a massive number of small files pushed into a hadoop cluster which will rides the cluster to shut down totally. This paper also high light on some native and proposed solutions for big data in small files, how do they work to reduce the negative effects on hadoop cluster, and add extra performance on storing and accessing mechanism.

show abstract

“…The problem in small file storage are creating an indices [10]. The small files are formed as clusters .…”

Section: A Techniques For Managing Small Files In Hadoopmentioning

confidence: 99%

Small Files Consolidation Technique in Hadoop Cluster

Mohanty

Ranjana

Subramanian

2019

International Journal of Simulation: Systems, Science &Amp; Technology

View full text Add to dashboard Cite

Hadoop distributed file system (HDFS) usually has large number of small files which causes negative impact on Hadoop performance. The performance tuning in Hadoop jobs are difficult because of the lack of single performance tuning techniques. The performance of tuning is also based on the amount of data that is transferred during the production of hadoop jobs. To overcome this and to improve the performance, a tuning is performed on hadoop. This is achieved with a Small File Consolidation(SFC).The objective of the proposed Small File Consolidation (SFC) is to overcome some of the current challenges with respect to performance of Hadoop Cluster. SFC will consolidate many number of smaller files currently being generated in Hadoop impacting the performance. It will merge all the smaller files into a single file of preset size. The size can be dynamic based on the environment. Thus the proposed SFC will improve the query execution time by generating the result set quickly which will result the effective management of cluster usage.

show abstract

An Efficient Approach for Storing and Accessing Small Files with Big Data Technology

Cited by 15 publications

References 7 publications

Ubiquitous Health Profile (UHPr): a big data curation platform for supporting health data interoperability

Ubiquitous Health Profile (UHPr): a big data curation platform for supporting health data interoperability

Available techniques in hadoop small file issue

Small Files Consolidation Technique in Hadoop Cluster

Contact Info

Product

Resources

About