As demands for understanding visual style among interior scenes increase, estimating style compatibility is becoming challenging. In particular, furniture styles are difficult to define due to their various elements, such as color and shape. As a result, furniture style is an ambiguous concept. To reduce ambiguity, Siamese networks have frequently been used to estimate style compatibility by adding various features that represent the style. However, it is still difficult to accurately represent a furniture’s style, even when using alternate features associated with the images. In this paper, we propose a new Siamese model that can learn from several furniture images simultaneously. Specifically, we propose a one-to-many ratio input method to maintain high performance even when inputs are ambiguous. We also propose a new metric for evaluating Siamese networks. The conventional metric, the area under the ROC curve (AUC), does not reveal the actual distance between styles. Therefore, the proposed metric quantitatively evaluates the distance between styles by using the distance between the embedding of each furniture image. Experiments show that the proposed model improved the AUC from 0.672 to 0.721 and outperformed the conventional Siamese model in terms of the proposed metric.
Text-based search engines can extract various types of information when a user enters an appropriate search query. However, a text-based search often fails in image retrieval when image understanding is needed. Deep learning (DL) is often used for image task problems, and various DL methods have successfully extracted visual features. However, as human perception differs for each individual, a dataset with an abundant number of images evaluated by human subjects is not available in many cases, although DL requires a considerable amount of data to estimate space ambiance, and the DL models that have been created are difficult to understand. In addition, it has been reported that texture is deeply related to space ambiance. Therefore, in this study, bag of visual words (BoVW) is used. By applying a hierarchical representation to BoVW, we propose a new interior style detection method using multi-scale features and boosting. The multi-scale features are created by combining global features from BoVW and local features that use object detection. Experiments on an image understanding task were conducted on a dataset consisting of room images with multiple styles. The results show that the proposed method improves the accuracy by 0.128 compared with the conventional method and by 0.021 compared with a residual network. Therefore, the proposed method can better detect interior style using multi-scale features.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.