2020
DOI: 10.48550/arxiv.2008.02693
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Abstract: Generating accurate descriptions for online fashion items is important not only for enhancing customers' shopping experiences, but also for the increase of online sales. Besides the need of correctly presenting the attributes of items, the expressions in an enchanting style could better attract customer interests. The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. Different from popular work on image captioning, it is hard to identify and describe the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
15
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(15 citation statements)
references
References 39 publications
(47 reference statements)
0
15
0
Order By: Relevance
“…One striking evidence from the literature is that captioning is formalized in some studies as a multi-label classification problem (Chen et al, 2012;Yamaguchi et al, 2015;Chen et al, 2015;Sun et al, 2016) whereas others address the problem using an encoder-decoder model where a Convolutional Neural Network (CNN) is used to encode images and a Recurrent Neural Network (RNN), such as Long Short-Term Memory (LSTM), is used to decode a description containing desired attributes (Vinyals et al, 2015;Xu et al, 2015;Herdade et al, 2019;Yang et al, 2020). Yang et al (2020) are the first, at our knowledge, applying captioning to fashion images using the encoder-decoder structure. They suggest an improvement of state-of-the-art by introducing an Attribute-Level and Sentence-Level Semantic rewards as metrics to enhance generated descriptions relevance.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…One striking evidence from the literature is that captioning is formalized in some studies as a multi-label classification problem (Chen et al, 2012;Yamaguchi et al, 2015;Chen et al, 2015;Sun et al, 2016) whereas others address the problem using an encoder-decoder model where a Convolutional Neural Network (CNN) is used to encode images and a Recurrent Neural Network (RNN), such as Long Short-Term Memory (LSTM), is used to decode a description containing desired attributes (Vinyals et al, 2015;Xu et al, 2015;Herdade et al, 2019;Yang et al, 2020). Yang et al (2020) are the first, at our knowledge, applying captioning to fashion images using the encoder-decoder structure. They suggest an improvement of state-of-the-art by introducing an Attribute-Level and Sentence-Level Semantic rewards as metrics to enhance generated descriptions relevance.…”
Section: Related Workmentioning
confidence: 99%
“…Automatic caption generation from a given image could have several use cases like recommendations in editing applications, virtual assistants, image indexing, assisting visually impaired persons to understand the content of an image (Srinivasan and Sreekanthan, 2018). Image captioning is a challenging task and it recently drew lots of attention from researchers in CV (Vinyals et al, 2015;Wang et al, 2020;Yang et al, 2020).…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…A recent attempt to develop a fashion-oriented image captioning architecture was the model proposed by Yang et al [ 14 ], which employs an LSTM language model trained with two reward functions, one related to the generation of single attributes and one that covers the semantics of the entire sentence. In this respect, the approach introduced in this manuscript takes a different path and aims at generating unbiased descriptions by retrieving additional information from an external source of textual data.…”
Section: Introductionmentioning
confidence: 99%
“…short sleeve, cotton and jersey) rather than only the coarse representation (what, where) in the general domain. In this case, the current general VL models [9,60] are sub-optimal for fashion-based tasks [1,26,67], and could be unfavorable when deploying global features based models to attribute-aware tasks, such as searching for a specific fashion captioning [75] and fashion catalog/object [15], where it is essential to extract finegrained features or similarities [65] from image and text.…”
mentioning
confidence: 99%