Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools

Łopuszyński, Michał; Bolikowski, Łukasz

doi:10.1007/978-3-319-08425-1_3

Cited by 3 publications

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper contains an extended version of the material presented on the Linking and Contextualizing Publications and Datasets Workshop, during the conference Theory and Practice of Digital Libraries 2013 [18].…”

Section: Introductionmentioning

confidence: 99%

Towards robust tags for scientific publications from natural language processing tools and Wikipedia

Łopuszyński

Bolikowski

2014

Int J Digit Libr

Self Cite

View full text Add to dashboard Cite

In this work, two simple methods of tagging scientific publications with labels reflecting their content are presented and compared. As a first source of labels, Wikipedia is employed. A second label set is constructed from the noun phrases occurring in the analyzed corpus. The corpus itself consists of abstracts from 0.7 million scientific documents deposited in the ArXiv preprint collection. We present a comparison of both approaches, which shows that discussed methods are to a large extent complementary. Moreover, the results give interesting insights into the completeness of Wikipedia knowledge in various scientific domains. As a next step, we examine the statistical properties of the obtained tags. It turns out that both methods show qualitatively similar rank-frequency dependence, which is best approximated by the stretched exponential curve. The distribution of the number of distinct tags per document follows also the same distribution for both methods and is well described by the negative binomial distribution. The developed tags are meant for use as features in various text mining tasks. Therefore, as a final step we show the preliminary results on their application to topic modeling.

show abstract

Section: Introductionmentioning

confidence: 99%