End-to-end Named Entity Recognition from English Speech

Yadav, Hemant Kumar; Ghosh, Sreyan; Chen, Yu; Shah, Rajiv Ratn

doi:10.48550/arxiv.2005.11184

Cited by 4 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, E2E models directly optimize the task-specific objective and also have smaller inference time; but such models typically require a large amount of task-specific labeled data to perform well. This can be seen from previous papers on E2E NER (Yadav et al, 2020;, where at least 100 hours of labeled data is typically used.…”

Section: Methodsmentioning

confidence: 99%

“…While named entity recognition in text has been studied extensively in the NLP community (Mikheev et al, 1999;Florian et al, 2003;Nadeau and Sekine, 2007;Ratinov and Roth, 2009;Ritter et al, 2011;Lample et al, 2016;Chiu and Nichols, 2016;Akbik et al, 2019;Wang et al, 2021b;Yamada et al, 2020), relatively little work has been conducted on extracting named entities from speech (Kim and Woodland, 2000;Sudoh et al, 2006;Parada et al, 2011;Caubrière et al, 2020;Yadav et al, 2020;Shon et al, 2021). Recognizing named entities from speech is a more challenging task which is commonly done through a pipeline approach: combining an automatic speech recognition (ASR) system with a text-based NER model (Sudoh et al, 2006;Raymond, 2013;Jannet et al, 2015).…”

Section: Spoken Named Entity Recognitionmentioning

confidence: 99%

“…Recognizing named entities from speech is a more challenging task which is commonly done through a pipeline approach: combining an automatic speech recognition (ASR) system with a text-based NER model (Sudoh et al, 2006;Raymond, 2013;Jannet et al, 2015). There is rising interest in end-to-end (E2E) approaches in the speech community and several E2E speech NER models have been introduced Caubrière et al, 2020;Yadav et al, 2020;Shon et al, 2021).…”

Section: Spoken Named Entity Recognitionmentioning

confidence: 99%

“…This E2E method outperforms their pipeline baseline, and pre-training the model on ASR improves the final NER performance. Yadav et al (2020) introduce an English speech NER dataset and propose an E2E approach based on DeepSpeech2 model and CTC objective (similar to ) combined with language model (LM) fusion. They show that LM fusion significantly improves the performance of the E2E approach, outperforming a pipeline baseline when trained on 150 hrs of labelled audio.…”

Section: Spoken Named Entity Recognitionmentioning

confidence: 99%

“…While Caubrière et al, 2020;Yadav et al, 2020) have shown that E2E models can outperform pipeline approaches in a fully supervised setting, they do not account for improvements in both speech and NLP from self-supervised pre-training and semi-supervised approaches. Shon et al (2021) have introduced new speech NER annotations for the public VoxPopuli corpus (Wang et al, 2021a) and show that E2E models still do not rival pipeline approaches when state-of-the-art pre-trained models such as DeBERTa (He et al, 2020) and wav2vec 2.0 (Baevski et al, 2020) are used.…”

Section: Spoken Named Entity Recognitionmentioning

confidence: 99%

See 4 more Smart Citations

On the Use of External Data for Spoken Named Entity Recognition

Pasad¹,

Wu²,

Shon³

et al. 2021

Preprint

View full text Add to dashboard Cite

Spoken language understanding (SLU) tasks involve mapping from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with limited labeled data. In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond selfsupervised pre-training, how can we use external speech and/or text data that are not annotated for the task? We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline (speech recognition followed by text NER model) approaches. We find that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations alone. Compared to prior work, we find improved F1 scores of up to 16%. While the best baseline model is a pipeline approach, the best performance when using external data is ultimately achieved by an end-to-end model. We provide detailed comparisons and analyses, showing for example that end-to-end models are able to focus on the more NER-specific words.

show abstract

Section: Methodsmentioning

confidence: 99%