CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

Kim, Jin-Hwa; Kitaev, Nikita; Chen, Xinlei; Rohrbach, Marcus; Zhang, Byoung-Tak; Tian, Yuandong; Batra, Dhruv; Parikh, Devi

doi:10.18653/v1/p19-1651

Cited by 46 publications

(61 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It is shown with minor modifications, Text2Scene can generate cartoon like, semantic layout, and real image like scenes. Dialogue based interaction is studied to control image synthesis, in order to improve complex scene generation progressively [219]- [223]. Meanwhile, text-to-image synthesis is extended to multiple images or videos, where visual consistency is required among the generated images [224]- [226].…”

Section: ) Other Topicsmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

284

View full text Add to dashboard Cite

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

Section: ) Other Topicsmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

284

View full text Add to dashboard Cite

show abstract

“…For this task, we use the synthetic Collaborative Drawing (CoDraw) dataset [8], which is composed of sequences of images along with associated dialogue of instructions and linguistic feedback ( Figure 2). Also, we introduce the Iterative CLEVR (i-CLEVR) dataset (Figure 4), a modified version of the Compositional Language and Elementary Visual Reasoning (CLEVR) [9] dataset, for incremental construction of CLEVR scenes based on linguistic instructions.…”

Section: Geneva Task and Datasetsmentioning

confidence: 99%

“…The most similar task to GeNeVA is the task proposed by the CoDraw [8] authors. They require a model to build a scene by placing the clip art images of the individual objects in their correct positions.…”

Section: Geneva Task and Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction

El-Nouby

Sharma

Schulz

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

108

View full text Add to dashboard Cite

Conditional text-to-image generation is an active area of research, with many possible applications. Existing research has primarily focused on generating a single image from available conditioning information in one step. One practical extension beyond one-step generation is a system that generates an image iteratively, conditioned on ongoing linguistic input or feedback. This is significantly more challenging than one-step generation tasks, as such a system must understand the contents of its generated images with respect to the feedback history, the current feedback, as well as the interactions among concepts present in the feedback history. In this work, we present a recurrent image generation model which takes into account both the generated output up to the current step as well as all past instructions for generation. We show that our model is able to generate the background, add new objects, and apply simple transformations to existing objects. We believe our approach is an important step toward interactive generation. Code and data is available at: https://www.microsoft.com/en-us/research/ project/generative-neural-visual-artist-geneva/.

show abstract

Sketch-Based Creativity Support Tools Using Deep Learning

Huang

Schoop

Ha³

et al. 2021

Human–Computer Interaction Series

View full text Add to dashboard Cite

CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

Cited by 46 publications

References 45 publications

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction

Sketch-Based Creativity Support Tools Using Deep Learning

Contact Info

Product

Resources

About