Fine-grained information extraction from fashion imagery is a challenging task due to the inherent diversity and complexity of fashion categories and attributes. Additionally, fashion imagery often depict multiple items while fashion items tend to follow hierarchical relations among various object types, categories and attributes. In this study, we address both issues with a 2-step hierarchical deep learning pipeline consisting of (1) a low granularity object type detection module (upper-body, lower-body, full-body, footwear) and (2) two classification modules for garment categories and attributes based on the outcome of the first step. For the category and attribute-level classification stages we examine a hierarchical label sharing (HLS) technique in two settings: (1) single-task learning (STL w/ HLS) and ( 2) multi-task learning with RNN and visual attention (MTL w/ RNN+VA). Our approach enables progressively focusing on appropriately detailed features for automatically learning the hierarchical relations of fashion and enabling predictions on images with complete outfits. Empirically, STL w/ HLS reached 93.99% top-3 accuracy while MTL w/ RNN+VA reached 97.57% top-5 accuracy for category
Estimating the preferences of consumers is of utmost importance for the fashion industry as appropriately leveraging this information can be beneficial in terms of profit. Trend detection in fashion is a challenging task due to the fast pace of change in the fashion industry. Moreover, forecasting the visual popularity of new garment designs is even more demanding due to lack of historical data. To this end, we propose MuQAR, a Multimodal Quasi-AutoRegressive deep learning architecture that combines two modules: (1) a multimodal multilayer perceptron processing categorical, visual and textual features of the product and (2) a Quasi-AutoRegressive neural network modelling the “target” time series of the product’s attributes along with the “exogenous” time series of all other attributes. We utilize computer vision, image classification and image captioning, for automatically extracting visual features and textual descriptions from the images of new products. Product design in fashion is initially expressed visually and these features represent the products’ unique characteristics without interfering with the creative process of its designers by requiring additional inputs (e.g. manually written texts). We employ the product’s target attributes time series as a proxy of temporal popularity patterns, mitigating the lack of historical data, while exogenous time series help capture trends among interrelated attributes. We perform an extensive ablation analysis on two large-scale image fashion datasets, Mallzee-P and SHIFT15m to assess the adequacy of MuQAR and also use the Amazon Reviews: Home and Kitchen dataset to assess generalization to other domains. A comparative study on the VISUELLE dataset shows that MuQAR is capable of competing and surpassing the domain’s current state of the art by 4.65% and 4.8% in terms of WAPE and MAE, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.