2013 IEEE International Conference on Acoustics, Speech and Signal Processing 2013
DOI: 10.1109/icassp.2013.6638338
|View full text |Cite
|
Sign up to set email alerts
|

Speaker trait characterization in web videos: Uniting speech, language, and facial features

Abstract: We present a multi-modal approach to speaker characterization using acoustic, visual and linguistic features. Full realism is provided by evaluation on a database of real-life web videos and automatic feature extraction including face and eye detection, and automatic speech recognition. Different segmentations are evaluated for the audio and video streams, and the statistical relevance of Linguistic Inquiry and Word Count (LIWC) features is confirmed. In the result, late multimodal fusion delivers 73, 92 and 7… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2013
2013
2020
2020

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 23 publications
0
3
0
Order By: Relevance
“…Let us now turn to multimedia feature extraction by considering speaker characterization in web videos as in [10]. An audio-visual feature extraction scheme similar to the one described in [10] using a proprietary implementation of video feature extraction can now be realized by using openSMILE exclusively.…”
Section: Speaker Characterization In Web Videosmentioning
confidence: 99%
See 2 more Smart Citations
“…Let us now turn to multimedia feature extraction by considering speaker characterization in web videos as in [10]. An audio-visual feature extraction scheme similar to the one described in [10] using a proprietary implementation of video feature extraction can now be realized by using openSMILE exclusively.…”
Section: Speaker Characterization In Web Videosmentioning
confidence: 99%
“…An audio-visual feature extraction scheme similar to the one described in [10] using a proprietary implementation of video feature extraction can now be realized by using openSMILE exclusively. The underlying idea is close to the 'toy' example from Figure 1, yet extracting over 1.5 k LLD-functional combinations from the audio (INTERSPEECH 2010 set, IS10, delivered as configuration file with the openSMILE distribution) and using sliding window lengths of 4 s. The configuration file for synchronized audio and video feature extraction is also delivered with the current release candidate.…”
Section: Speaker Characterization In Web Videosmentioning
confidence: 99%
See 1 more Smart Citation