2019
DOI: 10.1609/aaai.v33i01.33019601
|View full text |Cite
|
Sign up to set email alerts
|

Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets

Abstract: Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
1
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(2 citation statements)
references
References 6 publications
1
1
0
Order By: Relevance
“…The study used two datasets: CiteSeerX and IEEE. Similar research has also been conducted by Sefid et al [14] using four datasets: CiteSeerX, WoS, DBLP, and PubMed. A machine learning approach was adopted to perform entity matching using as many as seven features.…”
Section:  Issn: 2302-9285supporting
confidence: 64%
See 1 more Smart Citation
“…The study used two datasets: CiteSeerX and IEEE. Similar research has also been conducted by Sefid et al [14] using four datasets: CiteSeerX, WoS, DBLP, and PubMed. A machine learning approach was adopted to perform entity matching using as many as seven features.…”
Section:  Issn: 2302-9285supporting
confidence: 64%
“…Other factors that cause duplication are typography errors [12], omitted fields, and missing values [13]. The absence of a digital object identifier (DOI) in a scientific article also causes duplication [14]. According to research by Gyawali et al [15], more than 82% of the papers collected in databases do not have a DOI.…”
Section: Introductionmentioning
confidence: 99%