Automatic music soundtrack generation for outdoor videos from contextual sensor information

Yu, Yi; Shen, Zhijie; Zimmermann, Roger

doi:10.1145/2393347.2396493

Cited by 34 publications

(12 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We built an offline database of music consisting of songs with the mood tags listed in Figure 2, using the method described by Yu et al [14]. In Last.fm, songs are labeled with social tags based on the feedback of different users.…”

Section: Offline Music Datasetmentioning

confidence: 99%

“…Given a sequence of ranked mood tags G recognized from the trained SVM hmm model, the most frequent mood tags are selected and used as keys to find relevant songs from the hash table. We use a modified mean reciprocal rank (MMRR) method described in Yu et al [14] to retrieve songs sorted in decreasing order of their MMRR metric and the top-N songs are returned.…”

Section: Music Track Recommendationmentioning

confidence: 99%

See 1 more Smart Citation

User preference-aware music video generation based on modeling scene moods

Shah

Zimmermann

2014

Proceedings of the 5th ACM Multimedia Systems Conference

Self Cite

View full text Add to dashboard Cite

Due to technical advances in mobile devices (e.g., smartphones, tablets) and wireless communications, people now can easily capture user-generated videos (UGVs) anywhere, anytime and instantly share their real-life experiences via social web sites. Enjoying videos has become very popular entertainment. One challenge is that many mobile videos do not have very appealing audio that was captured with the video. In this demonstration, to overcome this issue we propose a music video generation/creation system (Android app and backend system) that aims to make UGVs more attractive by generating scene-adaptive and user-preference aware music tracks. In our system, we take geographic categories, visual content and user listening history into account. In particular, the sequences of geographic categories and visual features are integrated into a SVM hmm model to predict video scene moods. The music genre, as a user preference is also exploited to personalize the recommended songs. We believe this is the first work that predicts scene moods from a real-world video dataset collected by users' daily outdoor recordings to facilitate user-preference aware music video generation. Our experiments confirm that our system can effectively combine objective scene moods and individual music tastes to recommend appealing soundtracks for videos. Our Android app only sends recorded sensor data and a few keyframes of a UGV to a cloud service (backend system) to retrieve recommended music tracks, therefore it is bandwidth efficient since the transmission of video data is not required for analysis.

show abstract

Section: Offline Music Datasetmentioning

confidence: 99%

Section: Music Track Recommendationmentioning

confidence: 99%

User preference-aware music video generation based on modeling scene moods

Shah

Zimmermann

2014

Proceedings of the 5th ACM Multimedia Systems Conference

Self Cite

View full text Add to dashboard Cite

show abstract

“…There exist a few approaches [6,27,29] to recognize emotions from videos but the field of video soundtrack recommendation for UGVs [24,34] is largely unexplored. Hanjalic et al [6] proposed a computational framework for affective video content representation and modeling based on the dimensional approach to affect.…”

Section: Related Workmentioning

confidence: 99%

Advisor

Shah

Zimmermann

2014

Proceedings of the 22nd ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Capturing videos anytime and anywhere, and then instantly sharing them online, has become a very popular activity. However, many outdoor user-generated videos (UGVs) lack a certain appeal because their soundtracks consist mostly of ambient background noise. Aimed at making UGVs more attractive, we introduce ADVISOR, a personalized video soundtrack recommendation system. We propose a fast and effective heuristic ranking approach based on heterogeneous late fusion by jointly considering three aspects: venue categories, visual scene, and user listening history. Specifically, we combine confidence scores, produced by SVM hmm models constructed from geographic, visual, and audio features, to obtain different types of video characteristics. Our contributions are threefold. First, we predict scene moods from a real-world video dataset that was collected from users' daily outdoor activities. Second, we perform heuristic rankings to fuse the predicted confidence scores of multiple models, and third we customize the video soundtrack recommendation functionality to make it compatible with mobile devices. A series of extensive experiments confirm that our approach performs well and recommends appealing soundtracks for UGVs to enhance the viewing experience.

show abstract

“…However, the connection between them is not well explored so for. Effective matching techniques between music and image have various applications in cross-modal retrieval, music exploration [1], [2], and automatic music video generation [3], [4]. For example, music only may be tedious, but appears with image or video clips will bring more acousticvisual enjoyment.…”

Section: Introductionmentioning

confidence: 99%

“…But customizing the cover for every single music still remains an problem since an album always contain more than one song. Music generation for photo show and video have been studied in [3], [4], where emotion and contextual sensor information are utilized to help connecting music and video. Given the variety of user needs, in this work, we concentrate on the matching of music and image, one of the multimedia crossmodal matching tasks.…”

Section: Introductionmentioning

confidence: 99%

Bridging Music and Image via Cross-Modal Ranking Analysis

Qiao

Wang

et al. 2016

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Human perceptions of music and image are closely related to each other, since both can inspire similar human sensations, such as emotion, motion, and power. This paper aims to explore whether and how music and image can be automatically matched by machines. The main contributions are three aspects. First, we construct a benchmark dataset composed of more than 45, 000 music-image pairs. Human labelers are recruited to annotate whether these pairs are well-matched or not. The results show that they generally agree with each other on the matching degree of music-image pairs. Secondly, we investigate suitable semantic representations of music and image for this cross-modal matching task. In particular, we adopt lyric as a middle-media to connect music and image, and design a set of lyric-based attributes for image representation. Thirdly, we propose cross-modal ranking analysis (CMRA) to learn the semantic similarity between music and image with ranking labeling information. CMRA aims to find the optimal embedding spaces for both music and image in the sense of maximizing the ordinal margin between music-image pairs. The proposed method is able to learn the non-linear relationship between music and image, and to integrate heterogeneous ranking data from different modalities into a unified space. Experimental results demonstrate that the proposed method outperforms state-of-theart cross-modal methods in the music-image matching task, and achieves a consistency rate of 91.5% with human labelers.

show abstract

Automatic music soundtrack generation for outdoor videos from contextual sensor information

Cited by 34 publications

References 2 publications

User preference-aware music video generation based on modeling scene moods

User preference-aware music video generation based on modeling scene moods

Advisor

Bridging Music and Image via Cross-Modal Ranking Analysis

Contact Info

Product

Resources

About