2021
DOI: 10.3390/sym13020195
|View full text |Cite
|
Sign up to set email alerts
|

Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark

Abstract: One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, C… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(3 citation statements)
references
References 24 publications
0
3
0
Order By: Relevance
“…While the basic unit of information is very straightforward in a data file stored in characters (one byte equals one character), finding the actual data values is often much harder. This means that it is usually necessary to read the entire file to find any value [23][24][25].…”
Section: Binary Format Of Tsmld Storage Filesmentioning
confidence: 99%
“…While the basic unit of information is very straightforward in a data file stored in characters (one byte equals one character), finding the actual data values is often much harder. This means that it is usually necessary to read the entire file to find any value [23][24][25].…”
Section: Binary Format Of Tsmld Storage Filesmentioning
confidence: 99%
“…First of all, the massive amount of data has led to the reduction of information accuracy, and the change in the status and behavior of the relationship between things (Belov et al, 2021). The traditional accurate results are replaced by data that can be arranged (Nurnawati et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…The Apache Spark framework is much faster because of its in-memory storage and distributed computation. Parquet and Avro file systems are used in Spark frameworks which are very fast and efficient for big data analytics [3,4]. The data is represented as a binary format in Parquet and Avro.…”
Section: Introductionmentioning
confidence: 99%