Direct Speech-to-Image Translation

Li, Jiguo; Zhang, Xinfeng; Jia, Chuanmin; Xu, Jizheng; Zhang, Li; Wang, Yue; Ma, Siwei; Gao, Wen

doi:10.1109/jstsp.2020.2987417

Cited by 35 publications

(34 citation statements)

References 46 publications

(156 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to do so, we reimplemented StackGAN-v2 and replaced the text embedding with our speech embedding. Finally, we compare our results to the recently released speechbased model by [10] and the baseline used in that study, referred to as Classifier-based, in which the speech encoder was trained with the cross-entropy loss without using visual information. Specifically, in the Classifier-based method, a classifier layer is added after the speech encoder.…”

Section: B Results On the Synthesized Speech Databasesmentioning

confidence: 99%

“…However, all the natural language-to-image generation research mentioned above is based on written language, i.e., text descriptions. The task most related to our work is presented in [10], in which the authors adopted the teacher-student structure to learn speech embeddings and used StackGAN-v2 [7] as the generator to generate images from the content of the speech signal. In our work, we will compare our method with this recently proposed method.…”

Section: A Natural Language-to-image Generationmentioning

confidence: 99%

“…In this database, each image has one corresponding spoken caption. The work by [10], which is most closely related to our work, uses the Places database. In order to make a direct comparison with [10] on the task of S2IG, we use the same subset of the Places database as used in [10].…”

Section: Speech-to-image Generation With S2iganmentioning

confidence: 99%

“…The work by [10], which is most closely related to our work, uses the Places database. In order to make a direct comparison with [10] on the task of S2IG, we use the same subset of the Places database as used in [10]. Specifically, this subset (referred to as the Places-subset hereafter) consists of 7 scene categories: bedroom, dinette, dining room, home office, hotel room, kitchenette, and living room, with a total of 13,803 image-spoken caption pairs for training and 2,870 image-spoken caption pairs for testing.…”

Section: Speech-to-image Generation With S2iganmentioning

confidence: 99%

“…1 illustrates this new task. This task is similar to the independently and simultaneously proposed task of speech-to-image translation task [10].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Generating Images From Spoken Descriptions

Wang

Qiao

Zhu

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speechto-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised denselystacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised denselystacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing highquality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

show abstract

Section: B Results On the Synthesized Speech Databasesmentioning

confidence: 99%

Section: A Natural Language-to-image Generationmentioning

confidence: 99%