Multimodal Word Discovery and Retrieval With Spoken Descriptions and Visual Concepts

Wang, Liming; Hasegawa‐Johnson, Mark

doi:10.1109/taslp.2020.2996082

Cited by 1 publication

(1 citation statement)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the embedding space, the image representation can work as the supervision information to train the speech encoder. This task which relies on a matching relationship between images and their corresponding spoken descriptions spawned several other cross-modal tasks between visual and speech, i.e., the segmentation of the objects in an image and keywords in an utterance [28], [35] and multimodal word discovery [36], [37]. Most recently, Wang et al [38], [39] proposed the S2IGAN model to generate images based on spoken descriptions.…”

Section: B Cross-modal Learning Between Visual and Speechmentioning

confidence: 99%

Synthesizing Spoken Descriptions of Images

Wang

Hout

Zhu

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages' lack of a written form. To solve this problem, recently the image-to-speech task was proposed, which generates spoken descriptions of images bypassing any text via an intermediate representation consisting of phonemes (image-to-phoneme). Here, we present a comprehensive study on the image-to-speech task in which, 1) several representative image-to-text generation methods are implemented for the imageto-phoneme task, 2) objective metrics are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-tospeech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed. Extensive experiments are conducted on the public benchmark database Flickr8k. Results of our experiments demonstrate that 1) Stateof-the-art image-to-text models can perform well on the imageto-phoneme task, and 2) several evaluation metrics, including BLEU3, BLEU4, BLEU5, and ROUGE-L can be used to evaluate image-to-phoneme performance. Finally, 3) end-to-end image-tospeech bypassing text and phonemes is feasible.

show abstract