Proceedings of the 2013 KDD Cup 2013 Workshop 2013
DOI: 10.1145/2517288.2517293
|View full text |Cite
|
Sign up to set email alerts
|

Contextual rule-based feature engineering for author-paper identification

Abstract: We present the ideas and methodologies that we used to address the KDD Cup 2013 challenge on author-paper identification. We firstly formulate the problem as a personalized ranking task and then propose to solve the task through a supervised learning framework. The key point is to eliminate those incorrectly assigned papers of a given author based on existing records. We choose Gradient Boosted Tree as our main classifier. Through our exploration we conclude that the most critical factor to achieve our results… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 10 publications
0
5
0
Order By: Relevance
“…Being able to identify the real-world person based on the name appearing on a publication is a highly desirable feature but also a technically challenging problem. Author name disambiguation has drawn intensive research interests (Cock et al, 2013;Kanani, McCallum, & Pal, 2007;Li et al, 2015;Liu, Lei, Liu, Wang, & Han, 2013;Roy et al, 2013;Wick, Kobren, & McCallum, 2013;Zhang, Xinhua, Huang, & Yang, 2019;Zhang, Zhang, Yao, & Tang, 2018;Zhong et al, 2013), yet the state-of-the-art techniques, using only information in the publication data such as coauthorship, affiliations, and topics, typically do not yield high enough accuracy, especially for Asian or popular Western names. The reward of using these machine learning techniques is not high enough, so most systems have just used a simple name key (e.g., the author's last name prepended with first or middle initials, as in Google Scholar) to associate author names with publication clusters.…”
Section: Authorsmentioning
confidence: 99%
“…Being able to identify the real-world person based on the name appearing on a publication is a highly desirable feature but also a technically challenging problem. Author name disambiguation has drawn intensive research interests (Cock et al, 2013;Kanani, McCallum, & Pal, 2007;Li et al, 2015;Liu, Lei, Liu, Wang, & Han, 2013;Roy et al, 2013;Wick, Kobren, & McCallum, 2013;Zhang, Xinhua, Huang, & Yang, 2019;Zhang, Zhang, Yao, & Tang, 2018;Zhong et al, 2013), yet the state-of-the-art techniques, using only information in the publication data such as coauthorship, affiliations, and topics, typically do not yield high enough accuracy, especially for Asian or popular Western names. The reward of using these machine learning techniques is not high enough, so most systems have just used a simple name key (e.g., the author's last name prepended with first or middle initials, as in Google Scholar) to associate author names with publication clusters.…”
Section: Authorsmentioning
confidence: 99%
“…• Supervised feature-based baselines. As widely used in similar author identification/disambiguation problems [12,13,34,9,33], this thread of methods first extract features for each pair of training data, and then applies supervised learning algorithm to learn some ranking/classification functions. Following them, we extract 20+ related features for each pair of paper and author in the training set (details can be found in appendix).…”
Section: Baselines and Experimental Settingsmentioning
confidence: 99%
“…The problem of author identification has been briefly studied before [11]. And we also notice KDD Cup 2013 has similar author identification/disambiguation problem [12,13,34,9,33], where participants are asked to predict which paper is truly written by some author. However, different from the KDD Cup, our setting is different from them in the sense that (1) existing authors are unknown in our double-blind setting, and (2) we consider the reference of the paper, which is one of the most important sources of information.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations