As the availability of large, digital text corpora increases, so does the need for automatic methods to analyze them and to extract significant information from them. A number of algorithms have been developed for these applications, with topic modeling-based algorithms such as latent Dirichlet allocation (LDA) enjoying much recent popularity. In this paper, we focus on a specific but important problem in text analysis: Identifying coherent lexical combinations that represent "chunks of thought" within the larger discourse. We term these salient semantic chunks (SSCs), and present two complimentary approaches for their extraction. Both these approaches derive from a cognitive rather than purely statistical perspective on the generation of texts. We apply the two algorithms to a corpus of abstracts from IJCNN 2009, and show that both algorithms find meaningful chunks that elucidate the semantic structure of the corpus in complementary ways.
The availability of unstructured text as a source of data has increased by orders of magnitude in the last few years, triggering extensive research in the automated processing and analysis of electronic texts. An especially important and difficult problem is the identification of salient words in a corpus, so that further processing can focus on these words without distraction by uninformative words. Standard lists of stop words are used to remove common words such as articles, pronouns and prepositions, but many other words that should be removed are much harder to identify because word salience is highly context-dependent. In this paper, we describe a neu rodynamical approach for the context-dependent identification of salient words in large text corpora. The method, termed the Attractor Network-based Salient Word Extraction Rule (ANSWER)is modeled as a cognitive mechanism that identifies salient words based on their participation in coherent multi-word ideas.These ideas are, in turn, extracted via attractor dynamics in a recurrent neural network modeling the associative semantic graph of the corpus. The corpus used in this paper comprises the abstracts of all papers published in the proceedings of IJCNN 2009IJCNN , 2011IJCNN and 2013. The list of salient words that the system generates is compared with those generated by other standard metrics, and is found to outperform all of them in almost all cases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.