2021
DOI: 10.1109/tmm.2020.2976493
|View full text |Cite
|
Sign up to set email alerts
|

Re-Synchronization Using the Hand Preceding Model for Multi-Modal Fusion in Automatic Continuous Cued Speech Recognition

Abstract: Cued Speech (CS) is an augmented lip reading system complemented by hand coding, and it is very helpful to the deaf people. Automatic CS recognition can help communications between the deaf people and others. Due to the asynchronous nature of lips and hand movements, fusion of them in automatic CS recognition is a challenging problem. In this work, we propose a novel re-synchronization procedure for multi-modal fusion, which aligns the hand features with lips feature. It is realized by delaying hand position a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
20
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
2

Relationship

2
8

Authors

Journals

citations
Cited by 39 publications
(20 citation statements)
references
References 29 publications
0
20
0
Order By: Relevance
“…As compared to single-modal data, multi-modal data contains more information on a single target. Multi-modal fusion is divided into three different fusion levels [2]: pixel-level fusion, feature-level fusion, and decision-level fusion. In the three-level fusions, pixel-level fusion [3] is located at the lowest level, which can retain the original information to the greatest extent.…”
Section: Introductionmentioning
confidence: 99%
“…As compared to single-modal data, multi-modal data contains more information on a single target. Multi-modal fusion is divided into three different fusion levels [2]: pixel-level fusion, feature-level fusion, and decision-level fusion. In the three-level fusions, pixel-level fusion [3] is located at the lowest level, which can retain the original information to the greatest extent.…”
Section: Introductionmentioning
confidence: 99%
“…A feature extractor based on a multi-stream convolutional neural network (CNN) processing the raw regions of interest of hand and lips was combined with an HMM-GMM phonetic decoder. One of the major challenges in continuous ACSR is to deal with the asynchrony between hand and lip [7], i.e., the configuration of the hand pertaining to a certain phoneme can precede (or follow) that of the lips for the same phoneme by a variable delay ranging from a few milliseconds to several hundred milliseconds [8]. In [6], this issue has been addressed with a simple heuristic by considering the hand configuration observed at the beginning of the previous phoneme.…”
Section: Introductionmentioning
confidence: 99%
“…Acoustics signal processing [1,2,3], especially speaker verification [4,5,6], has been widely and successfully adopted in our daily life. Speaker verification aims at determining whether a given utterance belongs to a specific speaker.…”
Section: Introductionmentioning
confidence: 99%