Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Yang, Xuewen; Zhang, Heming; Jin, Di; Liu, Yingru; Tan, Jianchao; Xie, Dong; Wang, Jue; Wang, Xin

doi:10.48550/arxiv.2008.02693

Cited by 4 publications

(15 citation statements)

References 39 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One striking evidence from the literature is that captioning is formalized in some studies as a multi-label classification problem (Chen et al, 2012;Yamaguchi et al, 2015;Chen et al, 2015;Sun et al, 2016) whereas others address the problem using an encoder-decoder model where a Convolutional Neural Network (CNN) is used to encode images and a Recurrent Neural Network (RNN), such as Long Short-Term Memory (LSTM), is used to decode a description containing desired attributes (Vinyals et al, 2015;Xu et al, 2015;Herdade et al, 2019;Yang et al, 2020). Yang et al (2020) are the first, at our knowledge, applying captioning to fashion images using the encoder-decoder structure. They suggest an improvement of state-of-the-art by introducing an Attribute-Level and Sentence-Level Semantic rewards as metrics to enhance generated descriptions relevance.…”

Section: Related Workmentioning

confidence: 99%

“…Automatic caption generation from a given image could have several use cases like recommendations in editing applications, virtual assistants, image indexing, assisting visually impaired persons to understand the content of an image (Srinivasan and Sreekanthan, 2018). Image captioning is a challenging task and it recently drew lots of attention from researchers in CV (Vinyals et al, 2015;Wang et al, 2020;Yang et al, 2020).…”

mentioning

confidence: 99%

“…The work by Yang et al (2020) is one of the pioneering work applying image captioning to fashion images. The captions generated by Yang et al (2020) in their paper have both an objective part (description of attributes) and a subjective part aiming to embellish outfits descriptions to attract customers' attention and induce them to buy. The usefulness of this type of descriptions are very marketing oriented.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Neural Fashion Image Captioning : Accounting for Data Diversity

Gilles¹,

Noureini²

2021

Preprint

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Neural Fashion Image Captioning : Accounting for Data Diversity

Gilles¹,

Noureini²

2021

Preprint

View full text Add to dashboard Cite

“…A recent attempt to develop a fashion-oriented image captioning architecture was the model proposed by Yang et al [ 14 ], which employs an LSTM language model trained with two reward functions, one related to the generation of single attributes and one that covers the semantics of the entire sentence. In this respect, the approach introduced in this manuscript takes a different path and aims at generating unbiased descriptions by retrieving additional information from an external source of textual data.…”

Section: Introductionmentioning

confidence: 99%

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Moratelli

Barraco

Morelli

et al. 2023

Sensors

View full text Add to dashboard Cite

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

show abstract

“…short sleeve, cotton and jersey) rather than only the coarse representation (what, where) in the general domain. In this case, the current general VL models [9,60] are sub-optimal for fashion-based tasks [1,26,67], and could be unfavorable when deploying global features based models to attribute-aware tasks, such as searching for a specific fashion captioning [75] and fashion catalog/object [15], where it is essential to extract finegrained features or similarities [65] from image and text.…”

mentioning

confidence: 99%