One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, CSV, JSON, ORC, parquet. At the first stage, a comparative analysis of the main characteristics of the studied formats was carried out; at the second stage, an experimental evaluation of these formats was prepared and carried out. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to find out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework. In addition, within the study, an algorithm for choosing the optimal format from the presented alternatives was developed using tropical optimization methods. The result of the study is presented in the form of a technique for obtaining a vector of ratings of data storage formats for the Apache Hadoop system, based on an experimental assessment using Apache Spark.
One of the most important tasks of any platform for big data processing is the task of the storing data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format.
Objectives. An analysis of the problem of evaluating alternatives based on the results of expert paired comparisons is presented. The importance and relevance of this task is due to its numerous applications in a variety of fields, whether in the technical and natural sciences or in the humanities, ranging from construction to politics. In such contexts, the problem frequently arises concerning how to calculate an objective ratings vector based on expert evaluations. In terms of a mathematical formulation, the problem of finding the vector of objective ratings can be reduced to approximating the matrices of paired comparisons by consistent matrices.Methods. Analytical analysis and higher algebra methods are used. For some special cases, the results of numerical calculations are given.Results. The theorem stating that there is always a unique and consistent matrix that optimally approximates a given inversely symmetric matrix in a log-Euclidean metric is proven. In addition, derived formulas for calculating such a consistent matrix are presented. For small dimensions, examples are considered that allow the results obtained according to the derived formula to be compared with results for other known methods of finding a consistent matrix, i.e., for calculating the eigenvector and minimizing the discrepancy in the log-Chebyshev metric. It is proven that all these methods lead to the same result in dimension 3, while in dimension 4 all results are already different.Conclusions. The results obtained in the paper allow us to calculate the vector of objective ratings based on expert evaluation data. This method can be used in strategic planning in cases where conclusions and recommendations are possible only on the basis of expert evaluations.
The task of estimating the parameters of the Pareto distribution, first of all, of an indicator of this distribution for a given sample, is relevant. This article establishes that for this estimate, it is sufficient to know the product of the sample elements. It is proved that this product is a sufficient statistic for the Pareto distribution parameter. On the basis of the maximum likelihood method the distribution degree indicator is estimated. It is proved that this estimate is biased, and a formula eliminating the bias is justified. For the product of the sample elements considered as a random variable the distribution function and probability density are found; mathematical expectation, higher moments, and differential entropy are calculated. The corresponding graphs are built. In addition, it is noted that any function of this product is a sufficient statistic, in particular, the geometric mean. For the geometric mean also considered as a random variable, the distribution function, probability density, and the mathematical expectation are found; the higher moments, and the differential entropy are also calculated, and the corresponding graphs are plotted. In addition, it is proved that the geometric mean of the sample is a more convenient sufficient statistic from a practical point of view than the product of the sample elements. Also, on the basis of the Rao–Blackwell–Kolmogorov theorem, effective estimates of the Pareto distribution parameter are constructed. In conclusion, as an example, the technique developed here is applied to the exponential distribution. In this case, both the sum and the arithmetic mean of the sample can be used as sufficient statistics.
В настоящей работе исследуются статистические свойства оценки максимального правдоподобия показателя распределения Парето. Степенные законы распределения, такие, как распределение Парето, в последнее время привлекают пристальное внимание исследователей в самых различных областях науки и техники, от экономики и лингвистики до анализа интернет-трафика. Поэтому задача определения показателя степенного закона по заданной выборке имеет исключительную практическую важность. Аналитически доказано, что предлагаемая оценка является смещенной, хотя и состоятельной, и предложена формула, устраняющая смещение. Аналитически выведена формула для дисперсии несмещенной оценки. Кроме того, поставлена и аналитически решена задача о нахождении функции распределения и плотности вероятности этой оценки как случайной величины. Далее получены те же формулы для математического ожидания и дисперсии, но уже исходя из ранее найденной плотности вероятности. Полученные результаты могут быть использованы в различных областях человеческой деятельности, например, для предсказания интенсивности природных и техногенных катастроф.Ключевые слова: распределение Парето, метод максимального правдоподобия, несмещенная оценка.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.