2018
DOI: 10.1007/s11192-018-2865-9
|View full text |Cite
|
Sign up to set email alerts
|

The impact of imbalanced training data on machine learning for author name disambiguation

Abstract: In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers -Logistic Regression, Naïve Bayes, and Random Forestare trained through representative features such as coauthor names, and title words ex… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
25
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
1

Relationship

4
1

Authors

Journals

citations
Cited by 37 publications
(25 citation statements)
references
References 37 publications
0
25
0
Order By: Relevance
“…The impact of forename string on author name disambiguation is measured on four labeled data sets that have been widely used in many studies, separately or jointly (for example, Cota et al, ; Ferreira et al, ; Kim & Kim, ; Momeni & Mayr, ; Müller et al, ; Pereira et al, ; Santana, Gonçalves, Laender, & Ferreira, ; Shin et al, ; Wu, Li, Pei, & He, ; Zhu et al, ).…”
Section: Methodsmentioning
confidence: 99%
See 4 more Smart Citations
“…The impact of forename string on author name disambiguation is measured on four labeled data sets that have been widely used in many studies, separately or jointly (for example, Cota et al, ; Ferreira et al, ; Kim & Kim, ; Momeni & Mayr, ; Müller et al, ; Pereira et al, ; Santana, Gonçalves, Laender, & Ferreira, ; Shin et al, ; Wu, Li, Pei, & He, ; Zhu et al, ).…”
Section: Methodsmentioning
confidence: 99%
“…PENN: Labeled for Han et al () by researchers at the Pennsylvania State University, this data set was originally comprised of 8,453 name instances with their coauthorship, article title, and venue information. As its original version contained duplication and labeling errors, several studies modified the data set before use (for example, Cota et al, ; Kim & Kim, ; Santana et al, ; Shin et al, ). This study reuses one of recent revisions by Kim () in which 5,018 name instances and their associated metadata are linked to DBLP records after deduplication and verification of correctness…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations