Towards a French Smart-Home Voice Command Corpus: Design and NLU Experiments

Desot, Thierry; Raimondo, Stefania; Mishakova, Anastasia; Portet, François; Vacher, Michel

doi:10.1007/978-3-030-00794-2_55

Cited by 12 publications

(17 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The intent classifier we propose is close to the one of Liu et al [14]. Both classifiers have shown close performances on a voice command task [17], [18]. Although the classifier of Liu et al [14] has shown slightly better performances, it relies on aligned data while our intent classifier is independent from aligned data.…”

Section: Related Workmentioning

confidence: 57%

See 1 more Smart Citation

Towards End-to-End spoken intent recognition in smart home

Desot

Portet

Vacher

2019

2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)

Self Cite

View full text Add to dashboard Cite

on clean transcriptions whereas ASR transcriptions contain errors reducing the overall performance. Although the pipeline approach is widely adopted, there is a rising interest for end-toend (E2E) SLU which combines ASR and NLU in one model, avoiding the cumulative ASR and NLU errors of the pipeline approach [2], [3]. The main motivation for applying the E2E approach is that word by word recognition is not needed to infer intents. On top of that, the phoneme dictionary and language model (LM) of the ASR become optional. However, E2E approaches are highly dependent on large training data sets which are difficult to acquire, limiting the applicability to new domains where data is scarce which is the case for smart homes.The main contributions of this paper are: 1) the first work on E2E SLU for voice command in a smart home environment; 2) a comparison of a state-of-the-art pipeline approach that predicts intents from the ASR hypothesis and an E2E SLU model; 3) experiments performed with realistic non-English and synthetic data to deal with the paucity of domain specific data sets. Both approaches are positioned with respect to the state-of-the-art in Section II and are outlined in Section III. We tackle the lack of domain-specific data by using Natural Language Generation (NLG) and text-to-speech (TTS) to generate French voice command training data. An overview of these processes and data sets is given in Sections III and IV. Section V presents the results of experiments on a corpus of real smart home voice commands followed by a discussion, conclusion and outlook on future work. II. RELATED WORKSLU is typically seen as a slot-filling task in order to predict the speaker's intent on the one side and entities in a spoken utterance (slots and values) on the other side [1]. The most common approach is a pipeline of an ASR and an NLU module. The ASR system outputs the hypothesis transcriptions from a speech utterance that are analyzed by the NLU module to extract the meaning. While the slot-filling task is most often Abstract-Voice based interaction in a smart home has become a feature of many industrial products. These systems react to voice commands, whether it is for answering a question, providing music or turning on the lights. To be efficient, these systems must be able to extract the intent of the user from the voice command. Intent recognition from voice is typically performed through automatic speech recognition (ASR) and intent classification from the transcriptions in a pipeline. However, the errors accumulated at the ASR stage might severely impact the intent classifier. I n t his p aper, w e p ropose a n End-to-End (E2E) model to perform intent classification directly from the raw speech input. The E2E approach is thus optimized for this specific task and avoids error propagation. Furthermore, prosodic aspects of the speech signal can be exploited by the E2E model for intent classification (e.g., question vs imperative voice). Experiments on a corpus of voice commands acquired in a real smart home reveal t...

show abstract

Section: Related Workmentioning

confidence: 57%

“…Since the amount of real data is too small for training, the corpus generator of Desot et al [17] was used to produce training data automatically labeled with intents, slot and value labels for the SLU experiments. On top of that several syntactic variants per sentence are provided (table II).…”

Section: B Data Augmentation Via Artificial Data Generationmentioning

confidence: 99%

Towards End-to-End spoken intent recognition in smart home

Desot

Portet

Vacher

2019

2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although our seq2seq NLU model is close to the one of Liu et al [17] showing high performances on a voice command task using aligned data [19], it should learn to associate several words to one slot label without aligned data. For instance, from "Turn on the light" (Allume la lumière) the model generates the sequence intent[set device], action[turn on], device[light], without specifying the slot associated with the definite article.…”

Section: Pipeline Slumentioning

confidence: 93%

“…For that reason we used standard expert-based NLG [28]. The corpus generator of Desot et al [19] produced training data automatically labeled with intents and slots. It was built using the open source NLTK python library to which feature-respecting topdown grammar generation was added.…”

Section: Data Augmentation Using Artificial Data Generationmentioning

confidence: 99%

“…Intent classification is evaluated using the F1-score. As the NLU problem is designed as a sequence generation task using unaligned data, the type of errors differs from a sequence labeling task with aligned data as used in [19]. Typical errors using aligned data are substitutions whereas with unaligned data, frequent deletions and insertions occur.…”

Section: Pipeline Slu Baseline Approachmentioning

confidence: 99%

See 1 more Smart Citation

SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches

Desot

Portet

Vacher

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Low resource end-to-end spoken language understanding with capsule networks

Poncelet

Renkens

hamme

2021

Computer Speech & Language

View full text Add to dashboard Cite

Designing a Spoken Language Understanding (SLU) system for command-and-control applications is challenging. Both Automatic Speech Recognition and Natural Language Understanding are language and application dependent to a great extent. Even with a lot of design effort, users often still have to know what to say to the system for it to do what they want. We propose to use an end-to-end SLU system that maps speech directly to semantics and that can be trained by the user through demonstrations. The user can teach the system a new command by uttering the command and subsequently demonstrating its meaning through an alternative interface. The system will learn the mapping from the spoken command to the task. The dependency on the user also allows different languages and non-standard or impaired speech as valid inputs. Teaching the system requires effort from the user, so it is crucial that the system learns quickly. In this paper we propose to use capsule networks for this task, which are believed to be data efficient. We discuss two architectures for using capsule networks. We analyse their performance and compare them with two baseline systems, one based on Non-negative Matrix Factorisation (NMF) which has been successful for this task and one encoder-decoder approach. We show that in most cases the capsule network performs better than the baseline systems. Furthermore, we demonstrate the versatility of the architecture by inferring speaker identity and the user's word choice through multitask learning.

show abstract

Towards a French Smart-Home Voice Command Corpus: Design and NLU Experiments

Cited by 12 publications

References 14 publications

Towards End-to-End spoken intent recognition in smart home

Towards End-to-End spoken intent recognition in smart home

SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches

Low resource end-to-end spoken language understanding with capsule networks

Contact Info

Product

Resources

About