ORCID-linked labeled data for evaluating author name disambiguation at scale

Kim, Jinseok; Owen‐Smith, Jason

doi:10.1007/s11192-020-03826-6

Cited by 16 publications

(5 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this facet has not received enough attention in previous datasets. To examine the gender distribution, we used Genni+Ethnea (Smith et al, 2013; Torvik & Agarwal, 2016), a widely used gender dataset containing 4,934,974 distinct names collected from PubMed (Kim & Owen‐Smith, 2021; Subramanian et al, 2021). We queried genders from Genni+Ethnea by author names to obtain the gender predictions of LAGOS‐AND and MAG.…”

Section: Resultsmentioning

confidence: 99%

“…Due to the limited name patterns, a small dataset will restrict the exploration of some data‐driven techniques. Note that, although some datasets such as GESIS‐DBLP, 6 SCAD‐zbMATH (Müller et al, 2017), and Kim‐PubMed (Kim & Owen‐Smith, 2021) have decent numbers of instances, they are limited in scopes (covered domains). For example, SCAD‐zbMATH is designed specifically for a mathematical domain database, zbMATH 7 .…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

LAGOS‐AND: A large gold standard dataset for scholarly author name disambiguation

Zhang

Yang

2022

Asso for Info Science & Tech

View full text Add to dashboard Cite

In this article, we present a method to automatically build large labeled datasets for the author ambiguity problem in the academic world by leveraging the authoritative academic resources, ORCID and DOI. Using the method, we built LAGOS‐AND, two large, gold‐standard sub‐datasets for author name disambiguation (AND), of which LAGOS‐AND‐BLOCK is created for clustering‐based AND research and LAGOS‐AND‐PAIRWISE is created for classification‐based AND research. Our LAGOS‐AND datasets are substantially different from the existing ones. The initial versions of the datasets (v1.0, released in February 2021) include 7.5 M citations authored by 798 K unique authors (LAGOS‐AND‐BLOCK) and close to 1 M instances (LAGOS‐AND‐PAIRWISE). And both datasets show close similarities to the whole Microsoft Academic Graph (MAG) across validations of six facets. In building the datasets, we reveal the variation degrees of last names in three literature databases, PubMed, MAG, and Semantic Scholar, by comparing author names hosted to the authors' official last names shown on the ORCID pages. Furthermore, we evaluate several baseline disambiguation methods as well as the MAG's author IDs system on our datasets, and the evaluation helps identify several interesting findings. We hope the datasets and findings will bring new insights for future studies. The code and datasets are publicly available.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

LAGOS‐AND: A large gold standard dataset for scholarly author name disambiguation

Zhang

Yang

2022

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…This dataset was generated by the researchers at the University of Michigan Institute for Research on Innovation & Science (UM‐IRIS) through matching selected name instances in publication records to an authority database, ORCID (Kim & Owen‐Smith, 2021 ). First, author full names (e.g., “Brown, Michael”) that appear 50 times or more in MEDLINE‐indexed publications published between 2000 and 2019 were listed.…”

Section: Methodsmentioning

confidence: 99%

Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning

Kim

Owen‐Smith

2021

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

In several author name disambiguation studies, some ethnic name groups such as East Asian names are reported to be more difficult to disambiguate than others. This implies that disambiguation approaches might be improved if ethnic name groups are distinguished before disambiguation. We explore the potential of ethnic name partitioning by comparing performance of four machine learning algorithms trained and tested on the entire data or specifically on individual name groups. Results show that ethnicity‐based name partitioning can substantially improve disambiguation performance because the individual models are better suited for their respective name group. The improvements occur across all ethnic name groups with different magnitudes. Performance gains in predicting matched name pairs outweigh losses in predicting nonmatched pairs. Feature (e.g., coauthor name) similarities of name pairs vary across ethnic name groups. Such differences may enable the development of ethnicity‐specific feature weights to improve prediction for specific ethic name categories. These findings are observed for three labeled data with a natural distribution of problem sizes as well as one in which all ethnic name groups are controlled for the same sizes of ambiguous names. This study is expected to motive scholars to group author names based on ethnicity prior to disambiguation.

show abstract

“…Finally, editors who mandate ORCID from submitting, corresponding, or all authors as a prerequisite for submission to their journal need to be conscientious of the fact that this "integrity" tool for author identification and authentication is imperfect. Even though Kim and Owen-Smith (2021) heaped praise on ORCID, it is unclear how their large-scale meta-analysis of the ORCID database was unable to detect the accounts of fake authors, fraudulent authors, or false positives (i.e., identities that claim to be authors, but which are something or someone else, e.g., 35 cases reported in Teixeira da Silva (2021c). To fortify the concerns, a search for "fruit", including for specific fruits, yielded a fruitful set of results (Table 1).…”

Section: Action Is Needed Nowmentioning

confidence: 99%

A dangerous triangularization of conflicting values in academic publishing: ORCID, fake authors, and risks with the lack of criminalization of the creators of fake elements

Silva¹

2021

EML

View full text Add to dashboard Cite

The mainstream publishing establishment is under attack from multiple known and unknown forces. This is neither hyperbole nor fantasy. Many academics may believe that the main threat lies with “predatory” journals or publishers, but this is not necessarily the case because such entities are not always easy to distinguish clearly from veritable scholarly journals or publishers. Moreover, there is a gray zone that may involve both predatory and exploitative qualities. Current submission systems are not fail-safe because they allow unscholarly or fraudulent elements to register and abuse them, for example for submitting fake research or falsified peer reports, while author identification tools like ORCID are imperfect and provide a platform for similar-minded individuals to “validate” themselves. This toxic mix of tools aimed at fortifying integrity, while allowing fake authors to breed, currently without many, or any, ethical or legal repercussions will rapidly erode the entire publishing landscape if serious legal action is not taken. The creation of fake papers by fake authors will eventually trickle down into valid literature, by virtue of the fact that cited literature cannot be thoroughly vetted, even in peer review. The integrity of valid scholarly venues is thus at high risk unless suitable, strict and ethically and legally enforceable preventative measures are implemented.

show abstract

ORCID-linked labeled data for evaluating author name disambiguation at scale

Cited by 16 publications

References 54 publications

LAGOS‐AND: A large gold standard dataset for scholarly author name disambiguation

LAGOS‐AND: A large gold standard dataset for scholarly author name disambiguation

Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning

A dangerous triangularization of conflicting values in academic publishing: ORCID, fake authors, and risks with the lack of criminalization of the creators of fake elements

Contact Info

Product

Resources

About