Keyphrases associated with research papers provide an effective way to find useful information in the large and growing scholarly digital collections. However, keyphrases are not always provided with the papers, but they need to be extracted from their content. In this paper, we explore keyphrase extraction formulated as sequence labeling and utilize the power of Conditional Random Fields in capturing label dependencies through a transition parameter matrix consisting of the transition probabilities from one label to the neighboring label. We aim at identifying the features that, by themselves or in combination with others, perform well in extracting the descriptive keyphrases for a paper. Specifically, we explore word embeddings as features along with traditional, document-specific features for keyphrase extraction. Our results on five datasets of research papers show that the word embeddings combined with document specific features achieve high performance and outperform strong baselines for this task.
Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. CiteSeerX is one such digital library search engine that provides access to more than 10 million academic documents. We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. Precisely, we integrate Web search and classification in a unified approach to discover new homepages: first, we use publicly-available author names and research paper titles as queries to a Web search engine to find relevant content, and then we identify the correct homepages from the search results using a powerful deep learning classifier based on Convolutional Neural Networks. Moreover, we use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors' homepages. We show the development and deployment of the proposed approach in CiteSeerX and the maintenance requirements.
Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, currentday open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publiclyavailable research paper titles and author names are used as queries to a Web search engine. We were able to obtain ≈ 267, 000 unique research papers through our fullyautomated framework using ≈ 76, 000 queries, resulting in almost 200, 000 more papers than the number of queries. Moreover, through a combination of title and author name search, we were able to recover 78% of the original searched titles.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.