Finding and extracting topic-specific information from free-text sources is an important task for classifying and distinguishing content of information systems. Such a compression process of information, in which non-relevant text parts can also be ignored, is also advantageous with regard to the further machine processing and evaluation of topicspecific documents. State-of-the-art approaches normally use well-trained modern Natural Language Processing (NLP) methods to solve such tasks. However, use cases can arise where no suitable training data sets are available to adequately prepare or fine-tune the NLP methods used. In this paper, we want to detail a model-driven approach, applying an XML data model to an application-specific scenario, combining different NLP methods into a dynamic automated NLP pipeline. The goal of this pipeline is the automatic extraction of specific information (related to certain domains or topics) from text documents allowing a structured further processing of this information. Specifically, a scenario is considered where such information has to be aligned to a given information model, defining e.g. the terms relevant for the further processing. The solution approaches described here deal with a scenario in which information clusters on a specific topic can be obtained from a given data set, even without domain-specific model training. The basis is the use of a dynamic (i.e., using different NLP methods and models) and fully automatic (i.e., using different topics at the same time) pipeline architecture combined with an XML data model. The presented approach details and extends our earlier work and gives new qualitative and first quantitative results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.