Only aggressive elephants are fast elephants

Dittrich, Jens; Quiané-Ruiz, Jorge-Arnulfo; Richter, Stefan; Schuh, Stefan; Jindal, Alekh; Schad, Jörg

doi:10.14778/2350229.2350272

Cited by 81 publications

(79 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Later we will assert for hybrid approach to exploit both static and adaptive mechanisms so that flexible to query workload indexes are created. Physically reorders data rows using Quick sort [2] and stores sorted rows as a block [6]. Complexity of Quick Sort:…”

Section: Discussionmentioning

confidence: 99%

“…Less size than non-clustered [6]. However, for multiple indexes the size reaches storage capacity [20].…”

Section: Non-clusteredmentioning

confidence: 99%

“…Clustered static indexes which are developed for Hadoop framework, offer indexing on single attribute -Trojan index [9] or varying number of index attributes -HAIL [6]. Indexes are created on whole data set in parallel with data uploading.…”

Section: Related Workmentioning

confidence: 99%

“…All copies of last block are updated and record is inserted on its exact location [6]. [4] Record is appended on all copies of last block.…”

Section: Non-clusteredmentioning

confidence: 99%

“…To improve the efficiency of search and data retrieval process for voluminous data records many solutions have been proposed by researchers. For example, vertical partitioning [12], clustered attribute based indexing [6,9] for distributed parallel processing systems and clustered adaptive indexing [18] for changing query workload. Likewise in medical research, large distributed image data sets face the problem of multi-query optimization and a batch processing based image retrieval system [25] contributes in scheduling multiple query requests and minimized response time is achieved.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

On the analysis of big data indexing execution strategies

Siddiqa

Karim

Saba

et al. 2017

IFS

View full text Add to dashboard Cite

Abstract. Efficient response to search queries is very crucial for data analysts to obtain timely results from big data spanned over heterogeneous machines. Currently, a number of big-data processing frameworks are available in which search operations are performed in distributed and parallel manner. However, implementation of indexing mechanism results in noticeable reduction of overall query processing time. There is an urge to assess the feasibility and impact of indexing towards query execution performance. This paper investigates the performance of state-of-the-art clustered indexing approaches over Hadoop framework which is de facto standard for big data processing. Moreover, this study leverages a comparative analysis of nonclustered indexing overhead in terms of time and space taken by indexing process for varying volume data sets with increasing Index Hit Ratio. Furthermore, the experiments evaluate performance of search operations in terms of data access and retrieval time for queries that use indexes. We then validated the obtained results using Petri net mathematical modeling. We used multiple data sets in our experiments to manifest the impact of growing volume of data on indexing and data search and retrieval performance. The results and highlighted challenges favorably lead researchers towards improved implication of indexing mechanism in perspective of data retrieval from big data. Additionally, this study advocates selection of a non-clustered indexing solution so that optimized search performance over big data is obtained.

show abstract

Section: Discussionmentioning

confidence: 99%

“…Less size than non-clustered [6]. However, for multiple indexes the size reaches storage capacity [20].…”

Section: Non-clusteredmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…All copies of last block are updated and record is inserted on its exact location [6]. [4] Record is appended on all copies of last block.…”

Section: Non-clusteredmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

On the analysis of big data indexing execution strategies

Siddiqa

Karim

Saba

et al. 2017

IFS

View full text Add to dashboard Cite

show abstract

BSP cost and scalability analysis for MapReduce operations

Senger

Gil-Costa

Arantes

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Data abundance poses the need for powerful and easy-to-use tools that support processing large amounts of data. MapReduce has been increasingly adopted for over a decade by many companies, and more recently, it has attracted the attention of an increasing number of researchers in several areas. One main advantage is that the complex details of parallel processing, such as complex network programming, task scheduling, data placement, and fault tolerance, are hidden in a conceptually simple framework. MapReduce is supported by mature software technologies for deployment in data centers such as Hadoop. As MapReduce becomes popular for high-performance applications, many questions arise concerning its performance and efficiency.In this paper, we demonstrated formally lower bounds on the isoefficiency function for MapReduce applications, when these applications can be modeled as BSP jobs. We also demonstrate how communication and synchronization costs can be dominant for MapReduce computations and discuss the conditions under which such scalability limits are valid. To our knowledge, this is the first study that demonstrates scalability bounds for MapReduce applications. We also discuss how some MapReduce implementations such as Hadoop can mitigate such costs to approach linear, or near-to-linear speedups.

show abstract

An efficient similarity join approach on large‐scale high‐dimensional data using random projection

Zhang

Jia

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Similarity join on large-scale high-dimensional data faces major challenges because of the data scale and the cure of dimensionality. Random projection with p-stable distribution can reduce the high-dimensional data form d-dimension to k-dimension (k ≪ d), the distance of the data in k-dimensional space can be used to filter out as many data pairs as possible at relative low cost. Based on the above idea, we proposed two novel approaches to deal with large-scale high-dimensional data similarity join: projection-based similarity join (PromSimJ) algorithm and projection space partitioning-based similarity join (ProSPSimJ) algorithm. The comprehensive experiments were performed to test the performance of the above methods. We also compared the performance of the above methods with that of the naive method block nested loop join.The final experimental results prove that our approaches have much better performance and good scalability. KEYWORDShigh-dimensional data, p-stable distribution, random projection, similarity join INTRODUCTIONSimilarity join query (SJQ) aims to find out all the similar data pairs whose similarity is no less than the given similarity threshold (or whose distance is no more than the given distance threshold). As one of the hot research topics about big data analysis, SJQ has been widely used in many similarity search and data mining applications, such as duplicate web pages detection, 1 personalized recommendation, 2 trajectory clustering, 3 image classification, 4 and so on. Taking detection of duplicate web pages for example, as the number of the web pages increases, duplicate web pages will appear because of human reasons. To detect the duplicate web pages, each web page can be first translated into a high-dimensional vector after processing; then, calculating the distance between each pair of vectors, if the distance of one pair of vectors is less than the given distance threshold, they can be considered duplicate. The distance calculation is a time-costly operation because of the large number of web pages and the high dimensionality of the web-page vector. There have been many researches about SJQ, but some big challenges still exist when dealing with SJQ on large-scale high-dimensional data. As the dimensionality increases, the traditional filtering schemes based on tree-like index or space partitioning do not work. When the dimensionality is bigger than some threshold, the performance of the tree-like index 5 is perhaps Concurrency Computat Pract Exper. 2019;31:e5303. wileyonlinelibrary.com/journal/cpe

show abstract

Only aggressive elephants are fast elephants

Cited by 81 publications

References 26 publications

On the analysis of big data indexing execution strategies

On the analysis of big data indexing execution strategies

BSP cost and scalability analysis for MapReduce operations

An efficient similarity join approach on large‐scale high‐dimensional data using random projection

Contact Info

Product

Resources

About