Keyword extraction is a fundamental problem in text data mining and document processing. A large number of document processing applications directly depend on the quality and speed of keyword extraction algorithms. In this article, a novel approach to rapid change detection in data streams and documents is developed. It is based on ideas from image processing and especially on the Helmholtz Principle from the Gestalt Theory of human perception. Applied to the problem of keywords extraction, it delivers fast and effective tools to identify meaningful keywords using parameter-free methods. We also define a level of meaningfulness of the keywords which can be used to modify the set of keywords depending on application needs.
HP Laboratories HPL-2010-133 extraction, feature extraction, unusual behavior detection, Helmholtz principle, mining textual and unstructured datasets We present novel algorithms for feature extraction and change detection in unstructured data, primarily in textual and sequential data. Keyword and feature extraction is a fundamental problem in text data mining and document processing. A majority of document processing applications directly depend on the quality and speed of keyword extraction algorithms. In this article, a novel approach to rapid change detection in data streams and documents is developed. It is based on ideas from image processing and especially on the Helmholtz Principle from the Gestalt Theory of human perception. Applied to the problem of keywords extraction, it delivers fast and effective tools to identify meaningful keywords using parameter-free methods. We also define a level of meaningfulness of the keywords which can be used to modify the set of keywords depending on application needs.
In this paper we describe the possibility of constructing the well-known small world topology for an ordinary document, based on the actual document structure. Sentences in such a graph are represented by nodes, which are connected if and only if the corresponding sentences are neighbors or share at least one common keyword. This graph is built using a carefully selected one-parameter set of keywords. By varying this parameter -the level of meaningfulness -we transition the document-representing graph from a trivial path graph into a large random graph. During such a conversion, as the parameter is varied over its range, the graph becomes a small world. This in turn opens the possibility of applying many well-established ranking algorithms to the problem of ranking sentences and paragraphs in text documents. These rankings are, in turn, crucial for document understanding, summarization and information extraction. These graphs can also serve as a source of interesting small world graphs for the theory of complex networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.