Image-text retrieval is a task to search for the proper textual descriptions of the visual world and vice versa. One challenge of this task is the vulnerability to input image/text corruptions. Such corruptions are often unobserved during the training, and degrade the retrieval model's decision quality substantially. In this paper, we propose a novel image-text retrieval technique, referred to as robust visual semantic embedding (RVSE), which consists of novel image-based and text-based augmentation techniques called semantic-preserving augmentation for image (SPAug-I) and text (SPAug-T). Since SPAug-I and SPAug-T change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic-aware embedding vectors regardless of the corruption, improving the model's robustness significantly. From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.1 Index Terms-image-text retrieval, data augmentation, robustness, image and text corruption
INTRODUCTIONRecently, image-text retrieval, a task to find out images (sentence) that accurately describe a given sentence (image), has received special attention due to its wide range of applications such as image search, social networking service (SNS) hashtag/post generation, and semantic communication for Internet of Things (IoT), to name just a few [1,2,3,4,5]. Since it is in general very difficult to compare samples obtained from two different modalities (image and text), a projection of image and text to the common embedding space (a.k.a. visual semantic embedding (VSE) space) is required [1,2,6,7,8,9]. To generate the image and text embedding vectors, deep learning (DL)-based image and text feature extractors (i.e., ResNet and BERT) have been popularly used [10,11]. By comparing the obtained vectors, we can compute the similarity scores between available image-text pairs and then choose