2021
DOI: 10.1007/s11192-020-03826-6
|View full text |Cite
|
Sign up to set email alerts
|

ORCID-linked labeled data for evaluating author name disambiguation at scale

Abstract: How can we evaluate the performance of a disambiguation method implemented on big bibliographic data? This study suggests that the open researcher profile system, ORCID, can be used as an authority source to label name instances at scale. This study demonstrates the potential by evaluating the disambiguation performances of Author-ity2009 (which algorithmically disambiguates author names in MEDLINE) using 3 million name instances that are automatically labeled through linkage to 5 million ORCID researcher prof… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 16 publications
(5 citation statements)
references
References 54 publications
0
5
0
Order By: Relevance
“…However, this facet has not received enough attention in previous datasets. To examine the gender distribution, we used Genni+Ethnea (Smith et al, 2013; Torvik & Agarwal, 2016), a widely used gender dataset containing 4,934,974 distinct names collected from PubMed (Kim & Owen‐Smith, 2021; Subramanian et al, 2021). We queried genders from Genni+Ethnea by author names to obtain the gender predictions of LAGOS‐AND and MAG.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…However, this facet has not received enough attention in previous datasets. To examine the gender distribution, we used Genni+Ethnea (Smith et al, 2013; Torvik & Agarwal, 2016), a widely used gender dataset containing 4,934,974 distinct names collected from PubMed (Kim & Owen‐Smith, 2021; Subramanian et al, 2021). We queried genders from Genni+Ethnea by author names to obtain the gender predictions of LAGOS‐AND and MAG.…”
Section: Resultsmentioning
confidence: 99%
“…Due to the limited name patterns, a small dataset will restrict the exploration of some data‐driven techniques. Note that, although some datasets such as GESIS‐DBLP, 6 SCAD‐zbMATH (Müller et al, 2017), and Kim‐PubMed (Kim & Owen‐Smith, 2021) have decent numbers of instances, they are limited in scopes (covered domains). For example, SCAD‐zbMATH is designed specifically for a mathematical domain database, zbMATH 7 .…”
Section: Related Workmentioning
confidence: 99%
“…This dataset was generated by the researchers at the University of Michigan Institute for Research on Innovation & Science (UM‐IRIS) through matching selected name instances in publication records to an authority database, ORCID (Kim & Owen‐Smith, 2021 ). First, author full names (e.g., “Brown, Michael”) that appear 50 times or more in MEDLINE‐indexed publications published between 2000 and 2019 were listed.…”
Section: Methodsmentioning
confidence: 99%
“…Finally, editors who mandate ORCID from submitting, corresponding, or all authors as a prerequisite for submission to their journal need to be conscientious of the fact that this "integrity" tool for author identification and authentication is imperfect. Even though Kim and Owen-Smith (2021) heaped praise on ORCID, it is unclear how their large-scale meta-analysis of the ORCID database was unable to detect the accounts of fake authors, fraudulent authors, or false positives (i.e., identities that claim to be authors, but which are something or someone else, e.g., 35 cases reported in Teixeira da Silva (2021c). To fortify the concerns, a search for "fruit", including for specific fruits, yielded a fruitful set of results (Table 1).…”
Section: Action Is Needed Nowmentioning
confidence: 99%