Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations

Salamin, Hugues; Polychroniou, Anna; Vinciarelli, Alessandro

doi:10.1109/smc.2013.730

Cited by 25 publications

(20 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the SSPNet Vocalization corpus (SVC) (Salamin et al, 2013) for the experiments in this paper. This data was used as the benchmark during the Interspeech challenge and provides a platform for comparison of various algorithmic methods (Kaya et al, 2013; Pammi and Chetouani, 2013; Krikke and Truong, 2013; Brueckner and Schulter, 2014; An et al, 2013) The dataset consists of 2763 audio clips, each 11 seconds long.…”

Section: Databasementioning

confidence: 99%

“…We list the statistics for laughter and filler events over the entire database in Table 1. For more details on the dataset please refer to (Salamin et al, 2013; Schuller et al, 2013). …”

Section: Databasementioning

confidence: 99%

“…In particular, the Interspeech 2013 Social Signals Sub-challenge (Schuller et al, 2013) led to several investigations (Kaya et al, 2013; Pammi and Chetouani, 2013; Krikke and Truong, 2013; Brueckner and Schulter, 2014; An et al, 2013) on frame-wise detection of two specific non-verbal events: laughters and fillers. Building upon on our efforts (Gupta et al, 2013) on the same challenge dataset (Salamin et al, 2013) (that was the winning entry in the challenge), in this paper we perform further analysis and experiments. Previous works in this research field have primarily focused on local characteristics and our approach investigates the benefits of considering context during the frame-wise prediction.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Detecting paralinguistic events in audio stream using context in features and probabilistic decisions

Gupta

Audhkhasi

Lee

et al. 2016

Computer Speech & Language

View full text Add to dashboard Cite

Non-verbal communication involves encoding, transmission and decoding of non-lexical cues and is realized using vocal (e.g. prosody) or visual (e.g. gaze, body language) channels during conversation. These cues perform the function of maintaining conversational flow, expressing emotions, and marking personality and interpersonal attitude. In particular, non-verbal cues in speech such as paralanguage and non-verbal vocal events (e.g. laughters, sighs, cries) are used to nuance meaning and convey emotions, mood and attitude. For instance, laughters are associated with affective expressions while fillers (e.g. um, ah, um) are used to hold floor during a conversation. In this paper we present an automatic non-verbal vocal events detection system focusing on the detect of laughter and fillers. We extend our system presented during Interspeech 2013 Social Signals Sub-challenge (that was the winning entry in the challenge) for frame-wise event detection and test several schemes for incorporating local context during detection. Specifically, we incorporate context at two separate levels in our system: (i) the raw frame-wise features and, (ii) the output decisions. Furthermore, our system processes the output probabilities based on a few heuristic rules in order to reduce erroneous frame-based predictions. Our overall system achieves an Area Under the Receiver Operating Characteristics curve of 95.3% for detecting laughters and 90.4% for fillers on the test set drawn from the data specifications of the Interspeech 2013 Social Signals Sub-challenge. We perform further analysis to understand the interrelation between the features and obtained results. Specifically, we conduct a feature sensitivity analysis and correlate it with each feature's stand alone performance. The observations suggest that the trained system is more sensitive to a feature carrying higher discriminability with implications towards a better system design.

show abstract

Section: Databasementioning

confidence: 99%

“…We list the statistics for laughter and filler events over the entire database in Table 1. For more details on the dataset please refer to (Salamin et al, 2013; Schuller et al, 2013). …”

Section: Databasementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Detecting paralinguistic events in audio stream using context in features and probabilistic decisions

Gupta

Audhkhasi

Lee

et al. 2016

Computer Speech & Language

View full text Add to dashboard Cite

show abstract

“…[10,11,12]) is to extract a huge variety of audio-based features, and then perform classification at the utterance level. Notice that no machine learning is done at the frame level; however, in ASR (and in similar tasks such as laughter detection [13,14]) fine-tuned solutions exist on how frames should be classified. Unfortunately, these are usually ignored in computational paralinguistics, and in the notable exceptions when they are not (e.g.…”

Section: Introductionmentioning

confidence: 99%

DNN-Based Feature Extraction and Classifier Combination for Child-Directed Speech, Cold and Snoring Identification

et al. 2017

View full text Add to dashboard Cite

In this study we deal with the three sub-challenges of the Interspeech ComParE Challenge 2017, where the goal is to identify child-directed speech, speakers having a cold, and different types of snoring sounds. For the first two sub-challenges we propose a simple, two-step feature extraction and classification scheme: first we perform frame-level classification via Deep Neural Networks (DNNs), and then we extract utterancelevel features from the DNN outputs. By utilizing these features for classification, we were able to match the performance of the standard paralinguistic approach (which involves extracting thousands of features, many of them being completely irrelevant to the actual task). As for the Snoring Sub-Challenge, we divided the recordings into segments, and averaged out some frame-level features segment-wise, which were then used for utterance-level classification. When combining the predictions of the proposed approaches with those got by the standard paralinguistic approach, we managed to outperform the baseline values of the Cold and Snoring sub-challenges on the hidden test sets.

show abstract

“…The Social Signals Sub-Challenge of the Interspeech 2013 Computational Paralinguistics Challenge (ComParE) [5] further kindled research activities on laughter and filler detection [6,7,8,9] by providing a baseline database to compare research efforts.…”

Section: Introductionmentioning

confidence: 99%

Spotting Social Signals in Conversational Speech over IP: A Deep Learning Perspective

et al. 2017

View full text Add to dashboard Cite

The automatic detection and classification of social signals is an important task, given the fundamental role nonverbal behavioral cues play in human communication. We present the first cross-lingual study on the detection of laughter and fillers in conversational and spontaneous speech collected 'in the wild' over IP (internet protocol). Further, this is the first comparison of LSTM and GRU networks to shed light on their performance differences. We report frame-based results in terms of the unweighted-average area-under-the-curve (UAAUC) measure and will shortly discuss its suitability for this task. In the mono-lingual setup our best deep BLSTM system achieves 87.0 % and 86.3 % UAAUC for English and German, respectively. Interestingly, the cross-lingual results are only slightly lower, yielding 83.7 % for a system trained on English, but tested on German, and 85.0 % in the opposite case. We show that LSTM and GRU architectures are valid alternatives for e. g., on-line and compute-sensitive applications, since their application incurs a relative UAAUC decrease of only approximately 5% with respect to our best systems. Finally, we apply additional smoothing to correct for erroneous spikes and drops in the posterior trajectories to obtain an additional gain in all setups.

show abstract

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations

Cited by 25 publications

References 17 publications

Detecting paralinguistic events in audio stream using context in features and probabilistic decisions

Detecting paralinguistic events in audio stream using context in features and probabilistic decisions

DNN-Based Feature Extraction and Classifier Combination for Child-Directed Speech, Cold and Snoring Identification

Spotting Social Signals in Conversational Speech over IP: A Deep Learning Perspective

Contact Info

Product

Resources

About