AusKidTalk: An Auditory-Visual Corpus of 3- to 12-Year-Old Australian Children&amp;#8217;s Speech

Ahmed, Beena; Ballard, Kirrie J.; Burnham, Denis; Sirojan, Tharmakulasingam; Mehmood, Hadi; Estival, Dominique; Baker, Elise; Cox, Felicity; Arciuli, Joanne; Benders, Titia; Demuth, Katherine; Kelly, Barbara F.; Diskin-Holdaway, Chloé; Shahin, Mostafa; Sethu, Vidhyasaharan; Epps, Julien; Lee, Chwee Beng; Ambikairajah, Eliathamby

doi:10.21437/interspeech.2021-2000

Cited by 6 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A brief description of the available datasets of children's audio-visual emotion speech is presented in Table 1 and in more detail below. AusKidTalk (Australian children's speech corpus) [33]-audio and video recordings of game exercises for 750 children aged three to twelve who speak Australian English. The study participants were 700 children with typical development and 50 children with speech disorders-25 children aged 6-12 years have a diagnosis of autism spectrum disorder.…”

Section: Children's Audio-visual Speech Emotion Corporamentioning

confidence: 99%

A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

Matveev,

Frolova

et al. 2023

Mathematics

View full text Add to dashboard Cite

Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio–visual speech. In this work, we investigate the automatic classification of the audio–visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio–visual ER systems. In this paper, we present a new corpus of children’s audio–visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children’s audio–visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child–machine communications and environments where qualified professionals work with children.

show abstract

Section: Children's Audio-visual Speech Emotion Corporamentioning

confidence: 99%

A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

Matveev,

Frolova

et al. 2023

Mathematics

View full text Add to dashboard Cite

show abstract

“…The age range of the children in those databases is restricted; several are non-English speaking. The diversity of English dialects throughout the world also makes it difficult to combine numerous younger children's speech corpora from various nations or backgrounds (34). Furthermore, all of them, including these three, utilized problemspecific protocols with restricted tasks, and none of them is properly annotated.…”

Section: Ecs Transactions 107 (1) 9053-9064 (2022)mentioning

confidence: 99%

“…The goal is to bring together all of the media information associated with the recording and processing of spoken voice, making the task of speech recognition researchers easier (22). The paucity of studies on automated speech processing tools for children may be due to the difficulty of collecting and analyzing kid speech, particularly that of younger children (34). Data can be collected in two ways; first, researchers can collect by itself; second, it can also hire some speech data collection agencies.…”

Section: Ecs Transactions 107 (1) 9053-9064 (2022)mentioning

confidence: 99%

Challenges for Designing of Children Speech Corpora: A State-of-the-Art Review

Sobti¹,

Kadyan²,

Guleria³

2022

ECS Trans.

View full text Add to dashboard Cite

Automatic Speech Recognition (ASR) is the use of computer hardware and software-based techniques to identify and process human voices. These systems used data from both male and female speakers. The majority of commercial ASR systems available on adult speech are working efficiently. Speech data collected from both male and female speakers were used in these systems. In recent decades, ASR systems for children have emerged, such as reading tutors, aids for foreign language learning, and computer games. Child ASR systems are essential but poorly understood in the field of computer speech recognition. The child data collection is a very complex task. Child corpus is not available publicly, and variability of children speakers and ASR developed for a particular age group is not suitable for other age groups. These are some of the reasons for less and ineffective child ASR systems. However, the non-availability of child corpus publicly is a primary reason for ineffective child ASR. Designing and developing child corpus is a very tedious task. Therefore, the primary focus of this state-of-the-art review is to discuss various challenges encountered while designing and developing child corpus.

show abstract

“…For these reasons, it is important to gather and prepare good quality children's speech data to successfully train child-friendly speech-related AI models. However, there are additional challenges in the process of collecting child speech data [43], explaining the limited number of child-speech datasets available for research purposes.…”

mentioning

confidence: 99%