Efficient near-duplicate document detection using FPGAs

Luo, Xi; Najjar, Walid; Hristidis, Vagelis

doi:10.1109/bigdata.2013.6691698

Cited by 2 publications

(3 citation statements)

References 16 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There has been also a number of research into application-specific [28,29] and domain-specific accelerators [30,31,32,33]. Using tightly integrated FPGA [27,34] with CPU, and GPU with CPU [1,22], to accelerate big data processing have been proposed in recent work. While deploying programmable accelerator is a new and hot research topic, there has been little attention paid to how …”

Section: Related Workmentioning

confidence: 99%

“…BigDataBench [2] was released very recently and includes online service and offline analytics for web service applications. BigBench [27] is a new big data benchmark that adopts TPC-DS as its basis and expands it for offline analytics on Xeon high performance server. The CloudSuite [3,4] benchmark was developed for Scale-Out cloud workloads and mainly includes small data sets, e.g., 4.5 GB for Naïve Bayes.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Heterogeneous chip multiprocessor architectures for big data applications

Homayoun

2016

Proceedings of the ACM International Conference on Computing Frontiers

View full text Add to dashboard Cite

Emerging big data analytics applications require a significant amount of server computational power. The costs of building and running a computing server to process big data and the capacity to which we can scale it are driven in large part by those computational resources. However, big data applications share many characteristics that are fundamentally different from traditional desktop, parallel, and scale-out applications. Big data analytics applications rely heavily on specific deep machine learning and data mining algorithms, and are running a complex and deep software stack with various components (e.g. Hadoop, Spark, MPI, Hbase, Impala, MySQL, Hive, Shark, Apache, and MangoDB) that are bound together with a runtime software system and interact significantly with I/O and OS, exhibiting high computational intensity, memory intensity, I/O intensity and control intensity. Current server designs, based on commodity homogeneous processors, will not be the most efficient in terms of performance/watt for this emerging class of applications. In other domains, heterogeneous architectures have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on a core that matches resource needs more closely than a one-size-fits-all core. A heterogeneous architecture integrates cores with various micro-architectures and accelerators to provide more opportunity for efficient workload mapping. In this work, through methodical investigation of power and performance measurements, and comprehensive system level characterization, we demonstrate that a heterogeneous architecture combining high performance big and low power little cores is required for efficient big data analytics applications processing, and in particular in the presence of accelerators and near real-time performance constraints.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Heterogeneous chip multiprocessor architectures for big data applications

Homayoun

2016

Proceedings of the ACM International Conference on Computing Frontiers

View full text Add to dashboard Cite

show abstract

“…Previous work (Henzinger, 2006;Sood and Loguinov, 2011) has researched the possibilities of optimizing the second stage, i.e., the matching stage, in order to prevent a quadratic complexity of simhash identity similarity calculation. However, previous work show that the simhash identity calculation phase dictates the global execution time (Luo et al, 2013). Hence, in this study we suggest a method to treat the first phase of simhash inspired near-duplicate discovery, by using OpenCL in combination with CPUs, GPUs and FPGAs to rapidly process huge numbers of documents and calculate their simhash identities.…”

Section: Introductionmentioning

confidence: 99%

Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL

Canhasi¹

2018

Journal of Computer Science

View full text Add to dashboard Cite

Discovering identical or near-identical items is urgently important in many applications such as Web crawling since it drastically reduces the text processing costs. Simhash is a widely used technique, able to attribute a bit-string identity to a text, such that similar texts have similar identities. In this study, a real-time solution for a simhash calculation in OpenCL is presented. We also show how it can be utilized by multi-CPUs, GPUs and FPGAs. As a result we indicate that the bottom line computation realized on the FPGA through OpenCL provides significant power advantages.

show abstract

Efficient near-duplicate document detection using FPGAs

Cited by 2 publications

References 16 publications

Heterogeneous chip multiprocessor architectures for big data applications

Heterogeneous chip multiprocessor architectures for big data applications

Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL

Contact Info

Product

Resources

About