“…They use a mix of techniques. While some use similarity functions [2,7,12,18,21,27,30], others use learning techniques [1,14,16,28,32,35], heuristics [17,19,20,24], classifiers [9,10,34] and clustering methods [11,31].…”
Section: Background and Related Workmentioning
confidence: 99%
“…In several cases, we cannot locate a given publication in the coauthor's CV due to differences in the title spelling (lines 1-3 of Algorithm 2). In this case, our heuristic attempts to retrieve it by using a set of attributes: the Id of the coauthor's CV, year of publication, volume, and number of first and last pages (lines [5][6][7][8][9][10]. If the publication is not located, we use the most stable attributes, which are the id of the coauthor's CV and the publication year (lines 12-14).…”
One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.
“…They use a mix of techniques. While some use similarity functions [2,7,12,18,21,27,30], others use learning techniques [1,14,16,28,32,35], heuristics [17,19,20,24], classifiers [9,10,34] and clustering methods [11,31].…”
Section: Background and Related Workmentioning
confidence: 99%
“…In several cases, we cannot locate a given publication in the coauthor's CV due to differences in the title spelling (lines 1-3 of Algorithm 2). In this case, our heuristic attempts to retrieve it by using a set of attributes: the Id of the coauthor's CV, year of publication, volume, and number of first and last pages (lines [5][6][7][8][9][10]. If the publication is not located, we use the most stable attributes, which are the id of the coauthor's CV and the publication year (lines 12-14).…”
One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.
“…A group of techniques use machine learning algorithms (Veloso et al 2012;D'Angelo et al 2011;Cota et al 2010;Treeratpituk and Giles 2009;Kang 2008). Levin et al (2012Levin et al ( , 1031 (Ferreira et al 2010;Dai and Storkey 2009;Kang et al 2009b;Masada et al 2007), ontology-based method using properties (Kim et al 2011;Kim and Park 2009), and author profiling (Ferreira et al 2012b). Ferreira et al (2012a, 18-19) 2.1 Author disambiguation using unsupervised algorithm As a result, for a group of 5,332 authors with same names, they found 9,133 'real' individual authors.…”
Section: Review Of Author Name Disambiguation Techniquesmentioning
In citation analysis, author names are often used as the unit of analysis and some authors are indexed under the same name in bibliographic databases where the citation counts are obtained from. There are many techniques for author name disambiguation, using supervised, unsupervised, or semisupervised learning algorithms. Unsupervised approach uses machine learning algorithms to extract necessary bibliographic information from large-scale databases and digital libraries, while supervised approaches use manually built training datasets for clustering author groups for combining them with learning algorithms for author name disambiguation. The study examines various techniques for author name disambiguation in the hope for finding an aid to improve the precision of citation counts in citation analysis, as well as for better results in information retrieval.
“…HHC disambiguates a set of citation records by successively fusing clusters of citation records with similar author names based on a real-world heuristic applied to their citation attributes. Then, we present SAND -Self-training Associative Name Disambiguator [9,8]. SAND is a three-step selftraining method for author name disambiguation that requires no manual labeling and no parameterization (in real world scenarios).…”
Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. This problem occurs when an author publishes works under distinct names or distinct authors publish works under similar names. The challenges of dealing with author name ambiguity have led to a myriad of name disambiguation methods. In this tutorial, we characterize such methods by means of a proposed taxonomy, present an overview of some of the most representative ones and discuss open challenges.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.