This article describes an unsupervised approach for automatic classification of scientific literature archived in digital libraries and repositories according to a standard library classification scheme. The method is based on identifying all the references cited in the document to be classified and, using the subject classification metadata of extracted references as catalogued in existing conventional libraries, inferring the most probable class for the document itself with the help of a weighting mechanism. We have demonstrated the application of the proposed method and assessed its performance by developing a prototype software system for automatic classification of scientific documents according to the Dewey Decimal Classification (DDC) scheme. A dataset of one thousand research articles, papers, and reports from a well-known scientific digital library, CiteSeer, were used to evaluate the classification performance of the system. Detailed results of this experiment are presented and discussed.
Topical indexing of documents with keyphrases is a common method used for revealing the subject of scientific and research documents to both human readers and information retrieval tools, such as search engines. However, scientific documents that are manually indexed with keyphrases are still in the minority. This article describes a new unsupervised method for automatic keyphrase extraction from scientific documents which yields a performance on a par with human indexers. The method is based on identifying references cited in the document to be indexed and, using the keyphrases assigned to those references for generating a set of high-likelihood keyphrases for the document. We have evaluated the performance of the proposed method by using it to automatically index a third-party testset of research documents. Reported experimental results show that the performance of our method, measured in terms of consistency with human indexers, is competitive with that achieved by state-of-the-art supervised methods.
Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents which utilizes Wikipedia as a thesaurus for candidate selection from documents' content. We have devised a set of twenty statistical, positional, and semantical features for candidate phrases to capture and reflect various properties of those candidates which have the highest keyphraseness probability. We first introduce a simple unsupervised method for ranking and filtering the most probable keyphrases, and then evolve it into a novel supervised method using genetic algorithms. We have evaluated the performance of both methods on a third-party dataset of research papers. Reported experimental results show that the performance of our proposed methods, measured in terms of consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised and unsupervised methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.