In this paper, we present a method to automatically identify diseases from videos of gastrointestinal (GI) tract examinations using a Deep Convolutional Neural Network (DCNN) that processes images from digital endoscopes. Our goal is to aid domain experts by automatically detecting abnormalities and generating a report that summarizes the main findings. We have implemented a model that uses two different DCNN architectures to generate our predictions, which are also capable of running on a mobile device 1 . Using this architecture, we are able to predict findings on individual images. Combined with class activations maps (CAM), we can also automatically generate a textual report describing a video in detail while giving hints about the spatial location of findings and anatomical landmarks. Our work shows one way to use a multi-disease detection pipeline to also generate video reports that summarize key findings.
Automatically captioning images with natural language sentences is an important research topic. State of the art models are able to produce human-like sentences. These models typically describe the depicted scene as a whole and do not target specific objects of interest or emotional relationships between these objects in the image. However, marketing companies require to describe these important attributes of a given scene. In our case, objects of interest are consumer goods, which are usually identifiable by a product logo and are associated with certain brands. From a marketing point of view, it is desirable to also evaluate the emotional context of a trademarked product, i.e., whether it appears in a positive or a negative connotation. We address the problem of finding brands in images and deriving corresponding captions by introducing a modified image captioning network. We also add a third output modality, which simultaneously produces real-valued image ratings. Our network is trained using a classificationaware loss function in order to stimulate the generation of sentences with an emphasis on words identifying the brand of a product. We evaluate our model on a dataset of images depicting interactions between humans and branded products. The introduced network improves mean class accuracy by 24.5 percent. Thanks to adding the third output modality, it also considerably improves the quality of generated captions for images depicting branded products.
Most human pose estimation datasets have a fixed set of keypoints. Hence, trained models are only capable of detecting these defined points. Adding new points to the dataset requires a full retraining of the model. We present a model based on the Vision Transformer architecture that can detect these fixed points and arbitrary intermediate points without any computational overhead during infer ence time. Furthermore, independently detected interme diate keypoints can improve analyses derived from the keypoints such as the calculation of body angles. Our approach is based on TokenPose [9] and replaces the fixed keypoint tokens with an embedding of human readable key point vec tors to keypoint tokens. For ski jumpers, who benefit from intermediate detections especially of their skis, this model achieves the same performance as TokenPose on the fixed keypoints and can detect any intermediate keypoint directly.
Video-to-text (VTT) is the task of automatically generating descriptions for short audio-visual video clips. It can help visually impaired people to understand scenes shown in a YouTube video, for example. Transformer architectures have shown great performance in both machine translation and image captioning. In this work, we transfer promising approaches from image captioning and video processing to VTT and develop a straightforward Transformer architecture. Then, we expand this Transformer by a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset and improve the CIDEr and BLEU-4 scores by 21.72 and 8.38 points compared to a vanilla Transformer network and achieve state-of-the art results on the MSR-VTT and MSVD datasets. Also, our novel FPE helps increase the CIDEr score by relative 8.6 %.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.