Proceedings of the 18th ACM International Conference on Multimedia 2010
DOI: 10.1145/1873951.1874094
|View full text |Cite
|
Sign up to set email alerts
|

Automatic role recognition based on conversational and prosodic behaviour

Abstract: This paper proposes an approach for the automatic recognition of roles in settings like news and talk-shows, where roles correspond to specific functions like Anchorman, Guest or Interview Participant. The approach is based on purely nonverbal vocal behavioral cues, including who talks when and how much (turn-taking behavior), and statistical properties of pitch, formants, energy and speaking rate (prosodic behavior). The experiments have been performed over a corpus of around 50 hours of broadcast material an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2011
2011
2019
2019

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 12 publications
0
5
0
Order By: Relevance
“…Using the same data and targeting the same roles as above, Salamin et al [23] exploited specific acoustic features, such as who talks when and how much (turn-taking), and statistical properties of pitch, formant, energy, and speaking rate, reporting an accuracy of 89%.…”
Section: B Social Computing Approachesmentioning
confidence: 99%
“…Using the same data and targeting the same roles as above, Salamin et al [23] exploited specific acoustic features, such as who talks when and how much (turn-taking), and statistical properties of pitch, formant, energy, and speaking rate, reporting an accuracy of 89%.…”
Section: B Social Computing Approachesmentioning
confidence: 99%
“…We also have compared our results to prior similar studies on social dynamics of small groups using different technologies. These prior studies commonly collected much richer data (e.g., speaking turn and prosodic cues [100,101], head and body activity [101]) in both visuals and audios, while our tags collect only body distances and orientation. The most relevant work is [117], where Zancanaro et al used cameras and microphones to analyze the roles played by team members in relation to the tasks the group has to face ("Task Area") and in relation to the functioning of the group ("Socio-Emotional Area").…”
Section: Resultsmentioning
confidence: 99%
“…In [54], group cohesion is studied using hours of audio-visual group meeting data. [100] uses prosodic and turn-taking behaviors to identify participant's speaking role. [55] estimates group formations in crowded environments using a graph clustering algorithm.…”
Section: Related Workmentioning
confidence: 99%
“…The method is based on minimal set of objective speech tags that is beyond annotators' agreement (speakers, acoustic silences, and overlaps), and with no time consuming transcriptions. Methods as we suggest can further contribute to automatic role recognition [28], [29], and [30], to clienttherapist automatic assessment tools, such as The Motivational Interviewing Skills Code (MISC) [31], and to humancomputer interface [32].…”
Section: Resultsmentioning
confidence: 99%