HIT-AVDB-II: A New Multi-view and Extreme Feature Cases Contained Audio-Visual Database for Biometrics

Lin, Xiaoxin; Yao, Hongxun; Hong, Xiaopeng; Wang, Qian

doi:10.2991/jcis.2008.61

Cited by 4 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lip-reading datasets with people pronouncing sentences in other languages have also been created too. Examples include AV@CAR [15] and VLRF for Spanish, AVAS [16] and AVSD [20] for Arabic, BL [23] and IV2 [37] for French, UWB-05-HSAVC [57], and UWB-07-ICAV [58] for Czech, the German NDUTAVSC [49] dataset, the Russian HAVRUS [32] corpus and the HIT-AVDB-II [33] database that covers Chinese and English.…”

Section: B Word and Sentence Recognitionmentioning

confidence: 99%

Deep Learning-Based Automated Lip-Reading: A Survey

Fenghour¹,

Chen

Guo³

et al. 2021

IEEE Access

View full text Add to dashboard Cite

A survey on automated lip-reading approaches is presented in this paper with the main focus being on deep learning related methodologies which have proven to be more fruitful for both feature extraction and classification. This survey also provides comparisons of all the different components that make up automated lip-reading systems including the audio-visual databases, feature extraction, classification networks and classification schemas. The main contributions and unique insights of this survey are: 1) A comparison of Convolutional Neural Networks with other neural network architectures for feature extraction; 2) A critical review on the advantages of Attention-Transformers and Temporal Convolutional Networks to Recurrent Neural Networks for classification; 3) A comparison of different classification schemas used for lip-reading including ASCII characters, phonemes and visemes, and 4) A review of the most up-to-date lip-reading systems up until early 2021.

show abstract

Section: B Word and Sentence Recognitionmentioning

confidence: 99%

Deep Learning-Based Automated Lip-Reading: A Survey

Fenghour¹,

Chen

Guo³

et al. 2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…During the acquisition process, each speaker read each sentence three times at an even speed. Then HIT-AVDB-II [191] database was collected from Chinese poems, which contained 30 people, each reading 11 Chinese poems. IV2 [192] database was a sentence level database based on French, with 300 people participating in the recording, each speaking 15 French sentences.…”

Section: ) Word Phrase and Sentence Recognitionmentioning

confidence: 99%

“…AV Digital [119] database placed the camera at three angles of 0 °, 45 °, and 90 °. The HIT-AVDB-II [191] and LTS5 [170] collected view data at 0 °, 30 °, 60 °, and 90 °. LILiR [179] and OuluVS2 [173] collected view data at 0 °, 30 °, 45 °, 60 ° and 90 °.…”

Section: ) Multi View Databasesmentioning

confidence: 99%

A Survey of Research on Lipreading Technology

et al. 2020

View full text Add to dashboard Cite

“…A list of commonly-used English language AVSR databases is given in Table I [7], [11], AND [12]) ( Speech Recognition) medium to large-vocabulary continuous speech recognition: AV-TIMIT, GRID, VidTIMIT, IBM LVCSR and AusTalk. Of these, only GRID and VidTIMIT are currently available: AV-TIMIT and IBM LVCSR have not been released, while AusTalk is not yet available though a release is planned.…”

Section: Introductionmentioning

confidence: 99%

TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech

Harte

Gillen

2015

IEEE Trans. Multimedia

214

165

View full text Add to dashboard Cite

Automatic audio-visual speech recognition currently lags behind its audio-only counterpart in terms of major progress. One of the reasons commonly cited by researchers is the scarcity of suitable research corpora. This paper details the creation of a new corpus designed for continuous audio-visual speech recognition research. TCD-TIMIT consists of high-quality audio and video footage of 62 speakers reading a total of 6913 phonetically rich sentences. Three of the speakers are professionally-trained lipspeakers, recorded to test the hypothesis that lipspeakers may have an advantage over regular speakers in automatic visual speech recognition systems. Video footage was recorded from two angles: straight on, and at . The paper outlines the recording of footage, and the required post-processing to yield video and audio clips for each sentence. Audio, visual, and joint audio-visual baseline experiments are reported. Separate experiments were run on the lipspeaker and non-lipspeaker data, and the results compared. Visual and audio-visual baseline results on the non-lipspeakers were low overall. Results on the lipspeakers were found to be significantly higher. It is hoped that as a publicly available database, TCD-TIMIT will now help further state of the art in audio-visual speech recognition research. Index Terms-Audio-visual speech recognition.1520-9210

show abstract

HIT-AVDB-II: A New Multi-view and Extreme Feature Cases Contained Audio-Visual Database for Biometrics

Cited by 4 publications

References 13 publications

Deep Learning-Based Automated Lip-Reading: A Survey

Deep Learning-Based Automated Lip-Reading: A Survey

A Survey of Research on Lipreading Technology

TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech

Contact Info

Product

Resources

About