Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained sentences for scene description, and 8,913 story-related QA pairs. Our experimental results show that the DEMN outperforms other QA models. This is mainly due to 1) the reconstruction of video stories in a scene-dialogue combined form that utilize the latent embedding and 2) attention. DEMN also achieved state-of-the-art results on the MovieQA benchmark.
Abstract-The use of synthetic DNA molecules for computing provides various insights to evolutionary computation. A molecular computing algorithm to evolve DNA-encoded genetic patterns has been previously reported in [1], [2]. Here we improve on the previous work by studying the convergence behavior of the molecular evolutionary algorithm in the context of text classification problems. In particular, we study the error reduction behavior of the evolutionary learning algorithm, both theoretically and experimentally. The individuals represent decision lists of variable length and the whole population takes part in making probabilistic decisions. The evolutionary process is to change each individual towards correct classification of training data, which is based on an error minimization strategy. The evolved molecular classifiers show a performance competitive to the standard algorithms such as naïve Bayes and neural network classifiers on the data set we studied. The possibility of molecular implementation by use of DNA-encoded individuals combined with simple molecular operations on a very big population distinguishes this approach from other existing evolutionary algorithms.
Abstract. Humans can associate vision and language modalities and thus generate mental imagery, i.e. visual images, from linguistic input in an environment of unlimited inflowing information. Inspired by human memory, we separate a text-to-image retrieval task into two steps: 1) text-to-image conversion (generating visual queries for the 2 step) and 2) image-to-image retrieval task. This separation is advantageous for inner representation visualization, learning incremental dataset, using the results of content-based image retrieval. Here, we propose a visual query expansion method that simulates the capability of human associative memory. We use a hyperenetwork model (HN) that combines visual words and linguistic words. HNs learn the higher-order cross-modal associative relationships incrementally on a set of image-text pairs in sequence. An incremental HN generates images by assembling visual words based on linguistic cues. And we retrieve similar images with the generated visual query. The method is evaluated on 26 video clips of 'Thomas and Friends'. Experiments show the performance of successive image retrieval rate up to 98.1% with a single text cue. It shows the additional potential to generate the visual query with several text cues simultaneously.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.