Adversarial Inference for Multi-Sentence Video Description

Park, Jae Sung; Rohrbach, Marcus; Darrell, Trevor; Rohrbach, Anna

doi:10.1109/cvpr.2019.00676

Cited by 89 publications

(83 citation statements)

References 91 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If they are used in DL network directly, the training time will become extremely long, and much computational resources will be occupied due to the large number of layers. So, commonly, researchers apply the pre-trained ResNet on ImageNet dataset to extract visual features from images, and 3D ResNext on Kinetics dataset to extract spatio-temporal features from videos [11]. Then these features are fed to DL network as part of inputs.…”

Section: B Framework Architecture and Methodsmentioning

confidence: 99%

“…It can be used to replace the 2D LSTM network. Many CV pieces of research have shown that if these techniques can be jointly applied to make full use of the visual data, better results can be obtained [9], [11]. So, a single proper CV technique or an adequate combination of several CV techniques are required to deal with a specific problem in wireless systems.…”

Section: B the Selection Of CV Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Tian¹,

Pan²,

Alouini³

2020

Preprint

View full text Add to dashboard Cite

<div>Deep learning (DL) has obtained great success in computer vision (CV) field, and the related techniques have been widely used in security, healthcare, remote sensing, etc. On the other hand, visual data is universal in our daily life, which is easily generated by prevailing but low-cost cameras. Therefore, DL-based CV can be explored to obtain and forecast some useful information about the objects, e.g., the number, locations, distribution, motion, etc. Intuitively, DL-based CV can facilitate and improve the designs of wireless communications, especially in dynamic network scenarios. However, so far, it is rare to see such kind of works in the existing literature. Then, the primary purpose of this article is to introduce ideas of applying DL-based CV in wireless communications to bring some novel degrees of freedom for both theoretical researches and engineering applications. To illustrate how DL-based CV can be applied in wireless communications, an example of using DL-based CV to millimeter wave (mmWave) system is given to realize optimal mmWave multiple-input and multiple-output (MIMO) beamforming in mobile scenarios. In this example, we proposed a framework to predict the future beam indices from the previously-observed beam indices and images of street views by using ResNet, 3-dimensional ResNext, and long short term memory network. Experimental results show that our frameworks can achieve much higher accuracy than the baseline method, and visual data can help significantly improve the performance of MIMO beamforming system. Finally, we discuss the opportunities and challenges of applying DL-based CV in wireless communications.</div>

show abstract

Section: B Framework Architecture and Methodsmentioning

confidence: 99%

Section: B the Selection Of CV Techniquesmentioning

confidence: 99%

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Tian¹,

Pan²,

Alouini³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The prevailing video captioning techniques often incorporate the encoder-decoder pipeline inspired by the first successful sequence-to-sequence model S2VT [30]. Benefitting from the rapid development of deep learning, video captioning models have achieved remarkable advances using attention mechanism [27,39,45], memory networks [3,14,21,31], reinforcement learning [13,20,33] and generative adversarial networks [19,42]. Although these encoderCdecoder-based methods have reached impressive performance on automatic metrics, they often neglect how well the generated caption words (e.g., objects) are grounded in the video, making models less explainable and trustworthy.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, video captioning [10], the task of automatically generating a sequence of natural-language words to describe a video, has drawn increasing attention [18,19,33,51]. However, these models are known to have poor grounding performance, which leads to objects hallucination [23].…”

Section: Introductionmentioning

confidence: 99%

Relational Graph Learning for Grounded Video Description Generation

Zhang

Wang

Tang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained information and suffer from missing visual concepts. Moreover, relational words (e.g., "jump left or right") are usual spatio-temporal inference results, i.e., these words cannot be grounded on certain spatial regions. To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. Furthermore, the refined graph can be regarded as relational inductive knowledge to assist captioning models in selecting the relevant information it needs to generate correct words. We validate the effectiveness of our model through automatic metrics and human evaluation, and the results indicate that our approach can generate more fine-grained and accurate description, and it solves the problem of object hallucination to some extent. CCS CONCEPTS • Computing methodologies → Scene understanding.

show abstract

“…It can replace the 2D LSTM network. Much CV research has shown that if these techniques are jointly applied to make full use of the visual data, better results can be obtained [9], [11].…”

Section: B Selecting CV Techniquesmentioning

confidence: 99%

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Tian¹,

Pan²,

Alouini³

2020

Preprint

View full text Add to dashboard Cite

<div>Deep learning (DL) has seen great success in the computer vision (CV) field, and related techniques have been used in security, healthcare, remote sensing, and many other fields. As a parallel development, visual data has become universal in daily life, easily generated by ubiquitous low-cost cameras. Therefore, exploring DL-based CV may yield useful information about objects, such as their number, locations, distribution, motion, etc. Intuitively, DL-based CV can also facilitate and improve the designs of wireless communications, especially in dynamic network scenarios. However, so far, such work is rare in the literature. The primary purpose of this article, then, is to introduce ideas about applying DL-based CV in wireless communications to bring some novel degrees of freedom to both theoretical research and engineering applications. To illustrate how DL-based CV can be applied in wireless communications, an example of using a DL-based CV with a millimeter-wave (mmWave) system is given to realize optimal mmWave multiple-input and multiple-output (MIMO) beamforming in mobile scenarios. In this example, we propose a framework to predict future beam indices from previously observed beam indices and images of street views using ResNet, 3-dimensional ResNext, and a long short-term memory network. The experimental results show that our frameworks achieve much higher accuracy than the baseline method, and that visual data can significantly improve the performance of the MIMO beamforming system. Finally, we discuss the opportunities and challenges of applying DL-based CV in wireless communications.</div>

show abstract

Adversarial Inference for Multi-Sentence Video Description

Cited by 89 publications

References 91 publications

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Relational Graph Learning for Grounded Video Description Generation

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Contact Info

Product

Resources

About