Vision-guided robot hearing

Alameda-Pineda, Xavier; Horaud, Radu

doi:10.1177/0278364914548050

Cited by 30 publications

(7 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Setting p(z|x, y, m) = p(z|x, m) is equivalent to say that Z and Y are conditionally independent given x, which can be expressed equivalently . We now further calculate this expression for the J-GMM model defined in (7). Injecting 7into (9) leads to:…”

Section: Discussionmentioning

confidence: 99%

“…In contrast, we provided detailed derivation in [18], where (19) is shown to result into two equivalent forms of a GMR expression (25) and (26). Also, to be fully precise, (7) in [21] corresponds to (26) in [18] up to two differences that we interpret as typos: First, the term Σ…”

Section: A E-stepmentioning

confidence: 99%

“…voice conversion [2], [3], in image processing, e.g. head pose estimation from depth data [4], generation of hand writing [5], and in robotics [6], [7]. In the present paper, we consider the application of GMR to the speech acoustic-articulatory inversion problem, i.e.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping

Girin

Hueber

Alameda-Pineda

2017

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

International audienceThis article addresses the adaptation of an acoustic-articulatory inversion model of a reference speaker to the voice of another source speaker, using a limited amount of audio-only data. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a Gaussian mixture model and inference of articulatory data from acoustic data is made by the associated Gaussian mixture regression (GMR). To address speaker adaptation, we previously proposed a general framework called Cascaded-GMR (C-GMR) which decomposes the adaptation process into two consecutive steps: spectral conversion between source and reference speaker and acoustic-articulatory inversion of converted spectral trajectories. In particular, we proposed the Integrated C-GMR technique (IC-GMR) in which both steps are tied together in the same probabilistic model. In this article, we extend the C-GMR framework with another model called Joint-GMR (J-GMR). Contrary to the IC-GMR, this model aims at exploiting all potential acoustic-articulatory relationships, including those between the source speaker's acoustics and the reference speaker's articulation. We present the full derivation of the exact Expectation-Maximization (EM) training algorithm for the J-GMR. It exploits the missing data methodology of machine learning to deal with limited adaptation data. We provide an extensive evaluation of the J-GMR on both synthetic acoustic-articulatory data and on the multi-speaker MOCHA EMA database. We compare the J-GMR performance to other models of the C-GMR framework, notably the IC-GMR, and discuss their respective merits

show abstract

Section: Discussionmentioning

confidence: 99%

Section: A E-stepmentioning

confidence: 99%

See 1 more Smart Citation

Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping

Girin

Hueber

Alameda-Pineda

2017

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this paper we propose a novel multi-speaker tracking method inspired from previous research on "instantaneous" audio-visual fusion [11,12]. A dynamic Bayesian model is investigated to smoothly fuse acoustic and visual information over time from their feature spaces.…”

Section: Introductionmentioning

confidence: 99%

Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Ban

Girin

Alameda-Pineda

et al. 2017

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

Self Cite

View full text Add to dashboard Cite

Multi-speaker tracking is a central problem in humanrobot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.

show abstract

“…In this paper we propose to enforce audio-visual spatial coincidence, e.g., [1,8,10], rather than temporal coincidence, e.g., correlation [9,16], into diarization. We consider a setup consisting of people that are engaged in a multiparty conversation while they are free to move and to turn their attention away from the cameras.…”

Section: Introductionmentioning

confidence: 99%

Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model

Gebru

Evangelidis

et al. 2015

2015 IEEE International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

International audienceAny multi-party conversation system benefits from speaker diarization, that is, the assignment of speech signals among the participants. We here cast the diarization problem into a tracking formulation whereby the active speaker is detected and tracked over time. A probabilistic tracker exploits the on-image (spatial) coincidence of visual and auditory observations and infers a single latent variable which represents the identity of the active speaker. Both visual and auditory observations are explained by a recently proposed weighted-data mixture model, while several options for the speaking turns dynamics are fulfilled by a multi-case transition model. The modules that translate raw audio and visual data into on-image observations are also described in detail. The performance of the proposed tracker is tested on challenging data-sets that are available from recent contributions which are used as baselines for comparison

show abstract

Vision-guided robot hearing

Cited by 30 publications

References 49 publications

Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping

Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping

Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model

Contact Info

Product

Resources

About