We present an architecture for natural language processing that parses an input sentence incrementally and merges information about its structure with a representation of visual input, thereby changing the results of parsing. At each step of incremental processing, the elements in the context representation are judged whether they match the content of the sentence fragment up to that step. The information contained in the best matching subset then influences the result of parsing the subsentence. As processing progresses and the sentence is extended by adding new words, new information is searched in the context to concur with the expanded language input. This incremental approach to information fusion is highly adaptable with regard to the integration of dynamic knowledge extracted from a constantly changing environment. I. MOTIVATIONInformation gained from sensory perception of surroundings of any agent, be it a natural or an artificial one, requires the fusion of modal specific information. This is especially relevant whenever we address such an agent by means of a natural language interface and refer to things that are perceived by visual sensors such as cameras. The system introduced in this paper merges information represented by analyses of an incremental parser of German natural language sentences with knowledge from a representation of visual context. Information integration of this kind is realized as a fusion of data in an abstract, non-metrical space based on the structural properties of input from both modalities. This integration of external information can lead to different interpretations of a sentence fragment compared to an analysis that depends solely on a language model. Any system processing natural language instructions which refer to processes in a real-life environment has to be able to solve problems with regard to its highly dynamic, evolving and ambiguous input from several modalities. To do this, several requirements have to be fulfilled:Firstly, the NLP processing should produce a structural representation of its interpretation. This result is necessary to link the content of an utterance with any non-linguistic context information. As purely syntactic properties of language are difficult to link to content of a visual scene, any analysis of this kind needs to include a semantic interpretation of the linguistic information given.A system adequate for human-like interaction needs to parse its input in a human-like fashion, which means that the processing of a sentence is not started after the whole sentence is received but in an incremental way, starting processing right after the first word becomes available. Any incremental step (i.e. when additional language input is received) should produce a partial structural output that can immediately be used to link linguistic and visual information.In order to integrate visual input, the system has to provide interfaces to external information sources that contribute cues to be fused with its language interpretations. An interface of this kin...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.