Cross-covariance-based features for speech classification in film audio

Benatan, Matt; Ng, Kia

doi:10.1016/j.jvlc.2015.10.011

Cited by 5 publications

(1 citation statement)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech audio is automatically detected and background noise muted prior to further audio analysis. Although prior work on this exists for radio broadcasts [26], news broadcasts [16] and feature films [3], speech detection in TV shows is a very different domain. Comedy shows exhibit canned/audience laughter more so than in films.…”

Section: Character Data Collectionmentioning

confidence: 99%

Computer Vision – ECCV 2016 Workshops

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The objective of this work is to build virtual talking avatars of characters fully automatically from TV shows. From this unconstrained data, we show how to capture a character's style of speech, visual appearance and language in an effort to construct an interactive avatar of the person and effectively immortalize them in a computational model. We make three contributions (i) a complete framework for producing a generative model of the audiovisual and language of characters from TV shows; (ii) a novel method for aligning transcripts to video using the audio; and (iii) a fast audio segmentation system for silencing nonspoken audio from TV shows. Our framework is demonstrated using all 236 episodes from the TV series Friends [34] (≈ 97hrs of video) and shown to generate novel sentences as well as character specific speech and video.

show abstract

Section: Character Data Collectionmentioning

confidence: 99%

Computer Vision – ECCV 2016 Workshops

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

Virtual Immortality: Reanimating Characters from TV Shows

Charles

Magee

Hogg

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The objective of this work is to build virtual talking avatars of characters fully automatically from TV shows. From this unconstrained data, we show how to capture a character's style of speech, visual appearance and language in an effort to construct an interactive avatar of the person and effectively immortalize them in a computational model. We make three contributions (i) a complete framework for producing a generative model of the audiovisual and language of characters from TV shows; (ii) a novel method for aligning transcripts to video using the audio; and (iii) a fast audio segmentation system for silencing non-spoken audio from TV shows. Our framework is demonstrated using all 236 episodes from the TV series Friends (≈ 97 h of video) and shown to generate novel sentences as well as character specific speech and video.

show abstract