Audio description from image by modal translation network

Ning, Hailong; Zheng, Xiangtao; Yuan, Yuan; Lu, Xiaoqiang

doi:10.1016/j.neucom.2020.10.053

Cited by 18 publications

(3 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these methods are not suitable for large-scale RS image retrieval since they are based on low-level hand-crafted features. With the rapid development of artificial intelligent technology [22]- [26], a large number of uni-modal RS data retrieval methods based on deep learning have emerged. For instance, Tang et al [27] develop a two-stage re-ranking method to improve the retrieval performance.…”

Section: A Uni-modal Rs Data Retrieval Methodsmentioning

confidence: 99%

Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

Ning,

Zhao,

Yuan

2021

Preprint

Self Cite

View full text Add to dashboard Cite

With the development of earth observation technology, massive amounts of remote sensing (RS) images are acquired. To find useful information from these images, cross-modal RS image-voice retrieval provides a new insight. This paper aims to study the task of RS image-voice retrieval so as to search effective information from massive amounts of RS data. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship to narrow the heterogeneous semantic gap between images and voices. However, apart from the pairwise relationship included in the datasets, the intra-modality and non-paired inter-modality relationships should also be taken into account simultaneously, since the semantic consistency among non-paired representations plays an important role in the RS image-voice retrieval task. Inspired by this, a semantics-consistent representation learning (SCRL) method is proposed for RS imagevoice retrieval. The main novelty is that the proposed method takes the pairwise, intra-modality, and non-paired inter-modality relationships into account simultaneously, thereby improving the semantic consistency of the learned representations for the RS image-voice retrieval. The proposed SCRL method consists of two main steps: 1) semantics encoding and 2) semantics-consistent representation learning. Firstly, an image encoding network is adopted to extract high-level image features with a transfer learning strategy, and a voice encoding network with dilated convolution is devised to obtain high-level voice features. Secondly, a consistent representation space is conducted by modeling the three kinds of relationships to narrow the heterogeneous semantic gap and learn semantics-consistent representations across two modalities. Extensive experimental results on three challenging RS image-voice datasets, including Sydney, UCM and RSICD image-voice datasets, show the effectiveness of the proposed method.

show abstract

Section: A Uni-modal Rs Data Retrieval Methodsmentioning

confidence: 99%

Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

Ning,

Zhao,

Yuan

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In [28], the authors proposed a modal translation network (MT-Net) from visual to auditory senses to generate audio descriptions from images. They demonstrated that the proposed model could generate intelligible audio descriptions from visual images to a good extent.…”

Section: Deep Learning For Visual-auditory Conversionmentioning

confidence: 99%

Deep Learning-Based Optimization of Visual–Auditory Sensory Substitution

et al. 2023

View full text Add to dashboard Cite

Visual-auditory sensory substitution systems can aid blind people in traveling to various places and recognizing their own environments without help from others. Although several such systems have been developed, they are either not widely used or are limited to laboratory-scale research. Among various factors that hinder the widespread use of these systems, one of the most important issues to consider is the optimization of the algorithms for sensory substitution. This study is the first attempt at exploring the possibility of using deep learning for the objective quantification of sensory substitution. To this end, we used generative adversarial networks to investigate the possibility of optimizing the vOICe algorithm, a representative visual-auditory sensory substitution method, by controlling the parameters of the method for converting an image to sound. Furthermore, we explored the effect of the parameters on the conversion scheme for the vOICe system and performed frequency-range and frequency-mappingfunction experiments. The process of sensory substitution in humans was modeled to use generative models to assess the extent of visual perception from the substituted sensory signals. We verified the humanbased experimental results against the modeling results. The results suggested that deep learning could be used for evaluating the efficiency of algorithms for visual-auditory sensory substitutions without laborintensive human behavioral experiments. The introduction of deep learning for optimizing the visualauditory conversion method is expected to facilitate studies on various aspects of sensory substitution, such as generalization and estimation of algorithm efficiency.

show abstract

“…W ITH the high progress of remote sensing (RS) technology, RS images have shown a high-speed growth trend [1], [2]. Unearthing serviceable information from largescale RS images is very critical [3], [4]. Hence, many researchers pay attention to the research of remote sensing image retrieval (RSIR) because RSIR can quickly find effective information from large-scale RS images [5], [6].…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Transformer Balanced Hashing for Multispectral Remote Sensing Image Retrieval

Chen

Wang

Lü

et al. 2023

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

For remote sensing (RS) image retrieval task, hashing technology have been extensively researched in recent works. Unsupervised hashing approaches have attracted much attention in the RS data processing field because label collection takes a lot of time. Most of which fail to consider the interactions among the multichannel information of multispectral RS images and the disparity between the hash-like codes space and the Hamming space, which lead to the poor performance of multispectral RS image retrieval. In this article, we tackle these dilemmas with a novel unsupervised hashing approach, namely Unsupervised Transformer Balanced Hashing (UTBH), to utilize a convolutional variational autoencoder architecture with a novel RS transformer to perform effective hash codes learning. We first integrate a convolutional variational autoencoder architecture with a novel RS transformer, which can guide the interactions among the multichannel information of multispectral RS images. Meanwhile, a new objective function is proposed to preserve discrimination of hash codes in the hashing learning process and reduce the disparity between the hash-like codes space and the Hamming space effectively. Finally, experimental results on two multispectral RS image datasets indicate that UTBH approach achieves superior performance over other unsupervised image retrieval approaches.

show abstract

Audio description from image by modal translation network

Cited by 18 publications

References 53 publications

Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

Deep Learning-Based Optimization of Visual–Auditory Sensory Substitution

Unsupervised Transformer Balanced Hashing for Multispectral Remote Sensing Image Retrieval

Contact Info

Product

Resources

About