Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002
DOI: 10.1109/sam.2002.1191001
|View full text |Cite
|
Sign up to set email alerts
|

Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 22 publications
(25 citation statements)
references
References 5 publications
0
25
0
Order By: Relevance
“…Progress in addressing some or all of these questions can also benefit other areas where joint audio and visual speech processing is suitable [139], such as speaker identification and verification [49], [66], [109], [136], [140][141][142], visual text-to-speech [143][144][145] speech event detection [146] video indexing and retrieval [147], speech enhancement [102], [104], coding [148], signal separation [149], [150], and speaker localization [151][152][153]. Improvements in these areas will result in more robust and natural humancomputer interaction.…”
Section: Summary and Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Progress in addressing some or all of these questions can also benefit other areas where joint audio and visual speech processing is suitable [139], such as speaker identification and verification [49], [66], [109], [136], [140][141][142], visual text-to-speech [143][144][145] speech event detection [146] video indexing and retrieval [147], speech enhancement [102], [104], coding [148], signal separation [149], [150], and speaker localization [151][152][153]. Improvements in these areas will result in more robust and natural humancomputer interaction.…”
Section: Summary and Discussionmentioning
confidence: 99%
“…Audio feature enhancement on basis of either visual input [14], [101], or concatenated audio-visual features [102][103][104] (4), while seeking the best discrimination among the speech classes of interest. In [99], LDA is followed by an MLLT rotation of the feature vector to improve statistical data modeling by means of Gaussian mixture emission probability densities with diagonal covariances, as in (1).…”
Section: A Feature Fusionmentioning
confidence: 99%
“…MSE criterion between true and estimated outputs V(k) andV(k) is used as the performance measure in training. This is the same criterion as what is used in contrast function (8). Networks were trained using the LevenbergMarquardt algorithm [38] and via early stopping based on validation subset to avoid over-fit.…”
Section: Building Audio-visual Modelsmentioning
confidence: 99%
“…The complementary information of AV data is truly adopted in audio-visual speech recognition (AVSR) in either of early-(feature), middle-(model), or late-(decoding) stage fusion schemes to enhance robustness against acoustic distortions. In recent years (since 2001 [7]), researchers have proposed methods based on exploiting the coherent component of AV processes for applicable tasks like speech enhancement [7][8][9], acoustic feature enhancement [10], visual voice activity detection (VVAD) [11], and AV source separation (AVSS) [11][12][13][14][15][16][17][18][19][20][21][22][23][24].…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, the visual modality has also been exploited for speech enhancement in (background) noise (Girin et al, 2001;Deligne et al, 2002;Potamianos et al, 2003b), and more generally for speech source separation, i.e., for the extraction of a speech signal from complex mixtures using several microphones, for both linear instantaneous mixtures (Sodoyer et al, 2002(Sodoyer et al, , 2004) and convolutive mixtures (Wang et al, 2005;Rivet et al, 2007).…”
Section: A Context: Audio-visual Speech Processingmentioning
confidence: 99%