Thi-To-Quyen Tran scite author profile

Thi-To-Quyen Tran

4Publications

6Citation Statements Received

112Citation Statements Given

How they've been cited

How they cite others

107

112

Affiliations

Institut de Recherche en Informatique et Systèmes Aléatoires, University of Rennes

Publications

Order By: Most citations

Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters

Tran

Phan

Laurent

et al. 2018

View full text Add to dashboard Cite

Join operation is one of the key ones in databases, allowing to cross data from several tables. Two tuples are crossed when they share the same value on some attribute(s). A fuzzy or similarity join combines all pairs of tuples for which the distance is lower than or equal to a prespecified threshold ε from one or several relations. Fuzzy join has been studied by many researchers because its practical application. However, join is the most costly and may even not be possible to compute on large databases. In this paper, we thus propose the optimization for MapReduce algorithms to process fuzzy joins of binary strings using Hamming Distance. In particular we propose to use an extension of Bloom Filters to eliminate the redundant data, reduce the unnecessary comparisons, and avoid the duplicate output. We compare and evaluate analytically the algorithms with a cost model.

show abstract

Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark

Phan

Tran

et al. 2019

View full text Add to dashboard Cite

Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce

Tran

Phan

Laurent

et al. 2020

View full text Add to dashboard Cite

A fuzzy or similarity join is one of the most useful data processing and analysis operations for Big Data in a general context. It combines pairs of tuples for which the distance is lower than or equal to a given threshold ε. The fuzzy join is used in many practical applications, but it is extremely costly in time and space, and may even not be executed on large-scale datasets. Although there have been some studies to improve its performance by applying filters, a solution of an effective fuzzy filter for the join has never been conducted. In this paper, we thus extend our previous work by proposing a novel fuzzy filter to optimize fuzzy joins. This filter is a compact, probabilistic data structure that supports very fast similarity queries by maintaining a bit matrix, with small false positive rate and zero false negative rate. We show that our proposal is more efficient than others because of eliminating redundant data, reducing computation cost and avoiding duplicate output.

show abstract

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

Phan

Trieu

et al. 2021

SN COMPUT. SCI.

View full text Add to dashboard Cite

Currently, the estimated amount of data created daily have reached the threshold of petabytes or even zettabytes globally. It is no wonder that traditional data processing technologies cannot process and manage extremely large volumes of such data. However, these massive and various data can be used to deal with business problems that we would not have been able to tackle before. To discover their value, it is necessary to effectively perform query operations in a parallel and distributed manner. One of the standard and common query operations is an expensive join operation. This research systematically presents a theoretical and experimental comparison of the prominent join algorithms in the Spark environment. At first, this study shows the details of important strategies of two-way joins and recursive joins. Then, it exposes the advantages and disadvantages of each approach. Especially, the work provides mathematical cost models to make a more convince comparison of the joins before verifying by experiments. The results show that the comparison using the cost models is consistent with that using the experiments. Generally, the two-way and recursive joins using filters are the best choices while performing in the Spark environment.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Thi-To-Quyen Tran

Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters

Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark

Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

Contact Info

Product

Resources

About