Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases.

show abstract

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Cited by 3 publications

References 36 publications

GTSO: Gradient tangent search optimization enabled voice transformer with speech intelligibility for aphasia

GTSO: Gradient tangent search optimization enabled voice transformer with speech intelligibility for aphasia

Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition

Contact Info

Product

Resources

About