2008
DOI: 10.1007/978-3-540-85980-2_18
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data

Abstract: Abstract. Automatic transcription of telephone speech involves additional challenges compared to wideband data processing, mainly due to channel limitations and to particular characteristics of conversational telephone speech. While in TV speech recognition applications, such as automatic transcription of broadcast news, the presence of telephone data is nearly insignificant (less than 1 %), in most radio broadcast stations the presence of telephone speech grows significantly. Thus, transcription of telephone … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2013
2013
2020
2020

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 12 publications
0
2
0
Order By: Relevance
“…The version of AUDIMUS integrated in VITHEA uses an acoustic model trained with 57 h of downsampled Broadcast News data and 58 h of mixed fixed-telephone and mobile-telephone data in European Portuguese (Abad and Neto, 2008). The number of context input frames is 13 for the PLP and RASTA networks and 15 for the MSG network.…”
Section: The Baseline Speech Recognizermentioning
confidence: 99%
“…The version of AUDIMUS integrated in VITHEA uses an acoustic model trained with 57 h of downsampled Broadcast News data and 58 h of mixed fixed-telephone and mobile-telephone data in European Portuguese (Abad and Neto, 2008). The number of context input frames is 13 for the PLP and RASTA networks and 15 for the MSG network.…”
Section: The Baseline Speech Recognizermentioning
confidence: 99%
“…The baseline system combines three MLP-based acoustic models trained with Perceptual Linear Prediction features (PLP, 13 static + first derivative), log-RelAtive SpecTrAl features (RASTA, 13 static + first derivative) and Modulation SpectroGram features (MSG, 28 static). These model networks were trained with 57 hours of downsampled Broadcast News data and 58 hours of mixed fixed-telephone and mobile-telephone data in European Portuguese [19]. The number of context input frames is 13 for the PLP and RASTA networks and 15 for the MSG network.…”
Section: Asr/kws Systemmentioning
confidence: 99%