2020
DOI: 10.1109/taslp.2020.2980974
|View full text |Cite
|
Sign up to set email alerts
|

Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream Weights

Abstract: Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve the performance in these applications, especially under adverse acoustic conditions. A dynamic weighting of acoustic … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 65 publications
0
7
0
Order By: Relevance
“…( 1) is based on assuming statistical independence between acoustic and visual observations, as well as imposing an uninformative prior on ζi,j ∀ i, j, cf. [16]. However, when working with audiovisual input data, modalities may not be equally informative in all spatial regions due to, e.g., challenging lighting conditions or a directional noise source.…”
Section: Data Fusionmentioning
confidence: 99%
See 1 more Smart Citation
“…( 1) is based on assuming statistical independence between acoustic and visual observations, as well as imposing an uninformative prior on ζi,j ∀ i, j, cf. [16]. However, when working with audiovisual input data, modalities may not be equally informative in all spatial regions due to, e.g., challenging lighting conditions or a directional noise source.…”
Section: Data Fusionmentioning
confidence: 99%
“…Extending this concept to dynamic stream weights (DSWs) allows us to weight the contributions of each input based on their instantaneous reliability. This concept was originally proposed in the context of audiovisual automatic speech recognition (ASR) [14], and has proven valuable for speaker identification [15], but was also recently adopted for speaker localization and tracking [16,17].…”
Section: Introductionmentioning
confidence: 99%
“…However, these methods prefer to use the detection results of the single modality to assist the other modality to obtain more accurate observations, while neglecting to fully utilize the complementarity and redundancy of audio-visual information. In addition, most of the existing audio-visual trackers use generation algorithms (Ban et al 2019;Schymura The Thirty-Sixth AAAI Conference on Artificial Intelligence and Kolossa 2020; Qian et al 2017), which are difficult to adapt to random and diverse changes of target appearance. Furthermore, the likelihood calculation based on the color histogram or Euclidean distance is susceptible to interference from observation noise, which limits the performance of the fusion likelihood.…”
Section: Introductionmentioning
confidence: 99%
“…The stGCF-based audio cues are mapped to the localization space consistent with the visual cues. The integrated audio-visual cues combined with perception weights evaluated by the multi-modal perception attention network generate a fusion map that guides update step of the PF-based multi-modal tracker.weight flow(Schymura and Kolossa 2020). Probability Hypothesis Density (PHD) filter is introduced for tracking an unknown and variable number of speakers with the theory of Random Finite Sets (RFSs).…”
mentioning
confidence: 99%
“…However, these methods prefer to use the detection results of the single modality to assist the other modality to obtain more accurate observations, while neglecting to fully utilize the complementarity and redundancy of audio-visual information. In addition, most of the existing audio-visual trackers use generation algorithms (Ban et al 2019;Schymura and Kolossa 2020;Qian et al 2017), which are difficult to adapt to random and diverse changes of target appearance. Furthermore, the likelihood calculation based on the color histogram or Euclidean distance is susceptible to interference from observation noise, which limits the performance of the fusion likelihood.…”
Section: Introductionmentioning
confidence: 99%