The impact of imbalanced training data on machine learning for author name disambiguation

Kim, Jinseok; Kim, Jenna

doi:10.1007/s11192-018-2865-9

Cited by 37 publications

(25 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The impact of forename string on author name disambiguation is measured on four labeled data sets that have been widely used in many studies, separately or jointly (for example, Cota et al, ; Ferreira et al, ; Kim & Kim, ; Momeni & Mayr, ; Müller et al, ; Pereira et al, ; Santana, Gonçalves, Laender, & Ferreira, ; Shin et al, ; Wu, Li, Pei, & He, ; Zhu et al, ).…”

Section: Methodsmentioning

confidence: 99%

“…PENN: Labeled for Han et al () by researchers at the Pennsylvania State University, this data set was originally comprised of 8,453 name instances with their coauthorship, article title, and venue information. As its original version contained duplication and labeling errors, several studies modified the data set before use (for example, Cota et al, ; Kim & Kim, ; Santana et al, ; Shin et al, ). This study reuses one of recent revisions by Kim () in which 5,018 name instances and their associated metadata are linked to DBLP records after deduplication and verification of correctness…”

Section: Methodsmentioning

confidence: 99%

“…Two name instances in a block are compared for similarity over these four features as follows. Each name string is lower‐cased, converted into ASCII format, and segmented into an array of 2–4‐gram, following several studies (Han, Xu, Zha, & Giles, ; Kim & Kim, ; Kim, Kim, & Owen‐Smith, ; Louppe et al, ; Treeratpituk & Giles, ). For example, “Mark” is converted into a list of “ma,” “ar,” “rk,” “mar,” “ark,” and “mark.” Then a cosine similarity of the term frequency (TF) between the 2–4‐gram lists of two name instances is calculated as a forename similarity score for the instance pair.…”

Section: Methodsmentioning

confidence: 99%

“…This prediction score is used as a similarity distance between the pair to be fed into a hierarchical agglomerative clustering algorithm, which groups name instances into a cluster if their distances are above a certain threshold. Following previous studies (Kim & Kim, ; Levin et al, ; Liu et al, ; Louppe et al, ; Torvik & Smalheiser, ), this threshold is decided by trying various distance values between 0 and 1 in each block and choosing one that produces the best clustering result (measured by B‐Cubed F1 explained below) for the block . Meanwhile, to evaluate how a heuristic performs in comparison with algorithmic disambiguation, name instances in test data that match on all available name (surname + forename) strings are assumed to refer to the same author.…”

Section: Methodsmentioning

confidence: 99%

“…These clusters predicted by a disambiguation method from a test data set are compared to truth clusters that are generated from labels of name instances in the same test data set. Disambiguation performance is evaluated by B‐Cubed (B 3 ), following previous studies (Delgado, Martínez, Montalvo, & Fresno, ; Kim & Kim, ; Levin et al, ; Louppe et al, ; Momeni & Mayr, ; Müller et al, ; Qian, Zheng, Sakai, Ye, & Liu, ). This measure consists of three parts—B 3 Recall ( R ), B 3 Precision ( P ), and B 3 F ( F )—defined as follows (Levin et al, ):

R = \frac{1}{N} \sum_{t \in T} \frac{|P (t) \cap T (t)|}{|T (t)|}

P = \frac{1}{N} \sum_{t \in T} \frac{|P (t) \cap T (t)|}{|P (t)|}

F = \frac{2 \times R \times P}{R + P}

Here, t is a name instance in truth clusters T .…”

Section: Methodsmentioning

confidence: 99%

See 4 more Smart Citations

Effect of forename string on author name disambiguation

Kim

2019

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

In author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performance of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled data sets under varying ratios and lengths of full forenames, reflecting real-world scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). The results show that increasing the ratios of full forenames substantially improves both heuristic and machine-learning-based disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonyms are prevalent. As the ratios of full forenames increase, however, they become marginal compared to those by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation methods compared to using full-length strings.These findings provide practical suggestions, such as restoring initialized forenames into a full-string format via record linkage for improved disambiguation performances.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

R = \frac{1}{N} \sum_{t \in T} \frac{|P (t) \cap T (t)|}{|T (t)|}

P = \frac{1}{N} \sum_{t \in T} \frac{|P (t) \cap T (t)|}{|P (t)|}

F = \frac{2 \times R \times P}{R + P}

Here, t is a name instance in truth clusters T .…”

Section: Methodsmentioning

confidence: 99%

See 3 more Smart Citations

Effect of forename string on author name disambiguation

Kim

2019

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

show abstract

Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning

Kim

Owen‐Smith

2021

Asso for Info Science & Tech

Self Cite

View full text Add to dashboard Cite

In several author name disambiguation studies, some ethnic name groups such as East Asian names are reported to be more difficult to disambiguate than others. This implies that disambiguation approaches might be improved if ethnic name groups are distinguished before disambiguation. We explore the potential of ethnic name partitioning by comparing performance of four machine learning algorithms trained and tested on the entire data or specifically on individual name groups. Results show that ethnicity‐based name partitioning can substantially improve disambiguation performance because the individual models are better suited for their respective name group. The improvements occur across all ethnic name groups with different magnitudes. Performance gains in predicting matched name pairs outweigh losses in predicting nonmatched pairs. Feature (e.g., coauthor name) similarities of name pairs vary across ethnic name groups. Such differences may enable the development of ethnicity‐specific feature weights to improve prediction for specific ethic name categories. These findings are observed for three labeled data with a natural distribution of problem sizes as well as one in which all ethnic name groups are controlled for the same sizes of ambiguous names. This study is expected to motive scholars to group author names based on ethnicity prior to disambiguation.

show abstract