2013
DOI: 10.1016/j.eswa.2012.08.045
|View full text |Cite
|
Sign up to set email alerts
|

Detecting near-duplicate documents using sentence-level features and supervised learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(10 citation statements)
references
References 32 publications
0
10
0
Order By: Relevance
“…← the index of boundary object of matching partition of on (21) for all ( < < ) in do (22) if in ℎ then (23) c o n t i n u e (24) end if (25) ← sigNCD( , )…”
Section: P1: Pruning With Lower Boundmentioning
confidence: 99%
“…← the index of boundary object of matching partition of on (21) for all ( < < ) in do (22) if in ℎ then (23) c o n t i n u e (24) end if (25) ← sigNCD( , )…”
Section: P1: Pruning With Lower Boundmentioning
confidence: 99%
“…For example, if two instances have over 90% similarity, they can arguably be defined as redundant. Duplicate detection often regards such examples as 'near duplicates' (9) or 'approximate duplicates' (10). In bioinformatics, 'redundancy' is commonly used to describe records with sequence similarity over a certain threshold, such as 90% for CD-HIT (11).…”
Section: Kinds Of Duplicatementioning
confidence: 99%
“…Swiss-Prot is expert curated and reviewed, with software support, whereas TrEMBL is curated automatically without review. Here we list the steps of curation in Swiss-Prot, 9 as previously explained elsewhere (38):…”
Section: Quality Control In Uniprotmentioning
confidence: 99%
“…In this work, duplicates are typically defined as records with sequence similarity over a certain threshold, and other factors are not considered. These kinds of duplicates are often referred to as approximate duplicates or near duplicates (37), and are interchangeable with redundancies. For instance, one study located all records with over 90% mutual sequence identity (11).…”
Section: Introductionmentioning
confidence: 99%