2019 IEEE International Conference on Multimedia &Amp; Expo Workshops (ICMEW) 2019
DOI: 10.1109/icmew.2019.00134
|View full text |Cite
|
Sign up to set email alerts
|

Multi-modal Representation Learning for Short Video Understanding and Recommendation

Abstract: Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style. Different from real-life videos, video advertisements contain sufficient and useful multi-modal content like caption and speech, which provides crucial video semantics and would enhance the structuring process. In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting betw… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(10 citation statements)
references
References 22 publications
0
10
0
Order By: Relevance
“…Also, when working with social networks, user data was often used as additional information when creating the embedding. User history was often used to cluster items that possess similar user bases [19,46].…”
Section: What Resources Were Used In the Creation Of The Embeddings?mentioning
confidence: 99%
“…Also, when working with social networks, user data was often used as additional information when creating the embedding. User history was often used to cluster items that possess similar user bases [19,46].…”
Section: What Resources Were Used In the Creation Of The Embeddings?mentioning
confidence: 99%
“…The advent of deep learning delivered highly powerful hashing algorithms. Convolutional neural networks (CNN) were largely used to encode spatial content [13], [14], [15], [16]. Gou et al [13] used ResNet [17] to encode video and even audio features.…”
Section: Related Workmentioning
confidence: 99%
“…Convolutional neural networks (CNN) were largely used to encode spatial content [13], [14], [15], [16]. Gou et al [13] used ResNet [17] to encode video and even audio features. Xu et al [14] proposed to aggregate CNN frame features with VLAD encoding for event detection, sacrificing temporal coherence as the whole.…”
Section: Related Workmentioning
confidence: 99%
“…Multiple types of deep neural networks were utilized to analyze additional data sources including social networks, geospatial locations, and textual reviews. Guo et al [47] presented a multimodal representation learning method to predict user preferences based on multimodal content, including visual features, text features, audio features, and user interactive history in short video understanding and recommendation. Chen et al [48] proposed a novel neural architecture for fashion recommendation based on both image region-level features and user review information.…”
Section: Deep Learning For Recommender Systemsmentioning
confidence: 99%