2020
DOI: 10.1109/jstsp.2020.2987417
|View full text |Cite
|
Sign up to set email alerts
|

Direct Speech-to-Image Translation

Abstract: Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Spe… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
34
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 35 publications
(34 citation statements)
references
References 46 publications
(156 reference statements)
0
34
0
Order By: Relevance
“…In order to do so, we reimplemented StackGAN-v2 and replaced the text embedding with our speech embedding. Finally, we compare our results to the recently released speechbased model by [10] and the baseline used in that study, referred to as Classifier-based, in which the speech encoder was trained with the cross-entropy loss without using visual information. Specifically, in the Classifier-based method, a classifier layer is added after the speech encoder.…”
Section: B Results On the Synthesized Speech Databasesmentioning
confidence: 99%
See 4 more Smart Citations
“…In order to do so, we reimplemented StackGAN-v2 and replaced the text embedding with our speech embedding. Finally, we compare our results to the recently released speechbased model by [10] and the baseline used in that study, referred to as Classifier-based, in which the speech encoder was trained with the cross-entropy loss without using visual information. Specifically, in the Classifier-based method, a classifier layer is added after the speech encoder.…”
Section: B Results On the Synthesized Speech Databasesmentioning
confidence: 99%
“…However, all the natural language-to-image generation research mentioned above is based on written language, i.e., text descriptions. The task most related to our work is presented in [10], in which the authors adopted the teacher-student structure to learn speech embeddings and used StackGAN-v2 [7] as the generator to generate images from the content of the speech signal. In our work, we will compare our method with this recently proposed method.…”
Section: A Natural Language-to-image Generationmentioning
confidence: 99%
See 3 more Smart Citations