2020
DOI: 10.1108/dta-09-2019-0163
|View full text |Cite
|
Sign up to set email alerts
|

Entity deduplication in big data graphs for scholarly communication

Abstract: PurposeSeveral online services offer functionalities to access information from “big research graphs” (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multip… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 23 publications
0
4
0
Order By: Relevance
“…In this context, ORCID is integrated as an "inverted list" of bibliographic records, each bearing ORCID iDs for the authors that claimed that record via their ORCID profile. All such records are harmonised and deduplicated (Manghi et al, 2020b) so as to build one bibliographic record out of the many describing the same research product but collected from different sources. Of particular interest to this investigation, is the fact that such "richer" metadata records, feature author/creator metadata which may bear a list of ORCID iDs, as collected from ORCID records and as collected from data sources (ORCID referrals).…”
Section: A Report On Orcid Misapplicationsmentioning
confidence: 99%
“…In this context, ORCID is integrated as an "inverted list" of bibliographic records, each bearing ORCID iDs for the authors that claimed that record via their ORCID profile. All such records are harmonised and deduplicated (Manghi et al, 2020b) so as to build one bibliographic record out of the many describing the same research product but collected from different sources. Of particular interest to this investigation, is the fact that such "richer" metadata records, feature author/creator metadata which may bear a list of ORCID iDs, as collected from ORCID records and as collected from data sources (ORCID referrals).…”
Section: A Report On Orcid Misapplicationsmentioning
confidence: 99%
“…Many prominent clustering algorithms, such as k-means and k-median, use the number of clusters to output as an input. In de-duplication applications, this information is unknown [7] [8].…”
Section: Introductionmentioning
confidence: 99%
“…Clustering makes sure potentially equivalent records are grouped into clusters (aka “blocks”), within which the pair-wise similarity match will be quadratically applied. Blocking reduces the number of matches but, most importantly, allows for the parallel execution of the process across different blocks ( Manghi et al, 2020 ). The sliding window technique further optimizes the number of matches within the individual blocks by sorting the records in such a way that similar records are likely kept close to each other and then matching each record with the “k” following records (“K-length window”).…”
Section: Introductionmentioning
confidence: 99%