2013 IEEE International Conference on Big Data 2013
DOI: 10.1109/bigdata.2013.6691698
|View full text |Cite
|
Sign up to set email alerts
|

Efficient near-duplicate document detection using FPGAs

Abstract: Abstract-Detecting duplicate and near-duplicate documents is critical in applications like Web crawling since it helps save document processing resources. Simhash is a state-of-art method to assign a bit-string fingerprint to a document, such that similar documents have similar fingerprints. Finding the near-duplicates in a large collection of documents consists of two stages: (a) compute the simhash fingerprint of each document, and (b) find pairs of similar fingerprints by computing their Hamming distance.Pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2016
2016
2018
2018

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 16 publications
(29 reference statements)
0
3
0
Order By: Relevance
“…There has been also a number of research into application-specific [28,29] and domain-specific accelerators [30,31,32,33]. Using tightly integrated FPGA [27,34] with CPU, and GPU with CPU [1,22], to accelerate big data processing have been proposed in recent work. While deploying programmable accelerator is a new and hot research topic, there has been little attention paid to how …”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…There has been also a number of research into application-specific [28,29] and domain-specific accelerators [30,31,32,33]. Using tightly integrated FPGA [27,34] with CPU, and GPU with CPU [1,22], to accelerate big data processing have been proposed in recent work. While deploying programmable accelerator is a new and hot research topic, there has been little attention paid to how …”
Section: Related Workmentioning
confidence: 99%
“…BigDataBench [2] was released very recently and includes online service and offline analytics for web service applications. BigBench [27] is a new big data benchmark that adopts TPC-DS as its basis and expands it for offline analytics on Xeon high performance server. The CloudSuite [3,4] benchmark was developed for Scale-Out cloud workloads and mainly includes small data sets, e.g., 4.5 GB for Naïve Bayes.…”
Section: Related Workmentioning
confidence: 99%
“…Previous work (Henzinger, 2006;Sood and Loguinov, 2011) has researched the possibilities of optimizing the second stage, i.e., the matching stage, in order to prevent a quadratic complexity of simhash identity similarity calculation. However, previous work show that the simhash identity calculation phase dictates the global execution time (Luo et al, 2013). Hence, in this study we suggest a method to treat the first phase of simhash inspired near-duplicate discovery, by using OpenCL in combination with CPUs, GPUs and FPGAs to rapidly process huge numbers of documents and calculate their simhash identities.…”
Section: Introductionmentioning
confidence: 99%