A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

Sui, Dianbo; Tian, Zhengkun; Chen, Yubo; Liu, Kang; Zhao, Jun

doi:10.18653/v1/2021.acl-long.218

Cited by 17 publications

(6 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since people usually pause between words when speaking, the speech modality is used as an auxiliary modality to help the model identify the boundaries of entities in the text. Sui et al [24] used Mel filter bank features down-sampled by Convolutional Neural Network (CNN) as speech feature representation and fused with text representation obtained by BERT for entity recognition. (2) Text + font structure.…”

Section: Mnermentioning

confidence: 99%

Multimodal Data Fusion for Few-shot Named Entity Recognition Method

Zhang,

Liu

et al. 2024

International Journal of Software and Informatics

View full text Add to dashboard Cite

As a crucial subtask in Natural Language Processing (NLP), Named Entity Recognition (NER) aims to extract import information from text, which can help many downstream tasks such as machine translation, text generation, knowledge graph construction, and multimodal data fusion to deeply understand the complex semantic information of the text and effectively complete these tasks. In practice, due to time and labor costs, NER suffers from annotated data scarcity, known as few-shot NER. Although few-shot NER methods based on text have achieved good generalization performance, the semantic information that the model can extract is still limited due to the few samples, which leads to the poor prediction effect of the model. To this end, in this paper we propose a few-shot NER model based on multimodal data fusion, which provides additional semantic information with multimodal data for the first time, to help the model prediction and can further effectively improve the effect of multimodal data fusion and modeling. This method converts image information into text information as auxiliary modality information, which effectively solves the problem of poor modality alignment caused by the inconsistent granularity of semantic information contained in text and images. In order to effectively consider the label dependencies in few-shot NER, we use the CRF framework and introduce the state-of-the-art meta-learning methods as the emission module and the transition module. To alleviate the negative impact of noise samples in the auxiliary modal samples, we propose a general denoising network based on the idea of meta-learning. The denoising network can measure the variability of the samples and evaluate the beneficial extent of each sample to the model. Finally, we conduct extensive experiments on real unimodal and multimodal datasets. The experimental results show the outstanding generalization performance of the proposed method, where our method outperforms the state-of-the-art methods by 10% F 1 scores in the 1-shot scenario.

show abstract

Section: Mnermentioning

confidence: 99%

Multimodal Data Fusion for Few-shot Named Entity Recognition Method

Zhang,

Liu

et al. 2024

International Journal of Software and Informatics

View full text Add to dashboard Cite

show abstract

“…Some multimodal datasets also detect unique properties of human languages, such as sense of humor [17,18], metaphor [60], sarcasm [7,9]. Moreover, multimodal datasets are designed for a series of other tasks in NLP, such as dialogue act classification [39,40], named entity recognition [43], comprehension and reasoning [49,53], comments generation [48], fake news detection [33], etc. Nevertheless, there is a lack of multimodal datasets for intent analysis in real-world dialogue scenes.…”

Section: Related Work 61 Multimodal Language Datasetsmentioning

confidence: 99%

MIntRec: A New Dataset for Multimodal Intent Recognition

Zhang

Wang

et al. 2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

“…However, entity recognition in the feld of aviation manufacturing and assembly mainly concerns the recognition of key features such as the algorithm, parts, parameters, materials, functions, structures, and features involved in web pages, documents, patents, technical reports, etc., which are apparently domestic and foreign. Sui et al [13] proposed a multimodal multitasking algorithm based on their own labeled dataset to explore a multimodal named entity recognition (NER) approach for Chinese textual and auditory content by introducing a speech-to-text alignment assistance task. Zhang et al [14] proposed a machine reading comprehension framework that integrates adaptive positive untagging techniques into NER and experimentally demonstrated that the framework is efective for datasets containing a large number of untagged entities.…”

Section: Introductionmentioning

confidence: 99%

A TL_FLAT Model for Chinese Text Datasets of UAV Power Systems: Optimization and Performance

Shen

Yang

et al. 2023

International Journal of Intelligent Systems

View full text Add to dashboard Cite

The manufacturing processes of unmanned aerial vehicle (UAV) power systems generate large amounts of data and knowledge. The extraction of useful information or patterns from redundant data and knowledge texts has become a challenge in intelligent manufacturing. Unfortunately, graphics processing unit (GPU)-based parallel computing is limited, and the inference speeds of the available named entity recognition (NER) models for Chinese text datasets are low because they are mainly based on the long short-term memory (LSTM) algorithm. Herein, first, the flat-lattice transformer (FLAT) model was optimized by using a stochastic gradient descent with momentum (SGDM) optimizer and adjusting the model hyperparameters. Compared with the existing NER methods, the proposed optimization algorithm achieved better performance on the available dataset. Then, an NER method named the TL_FLAT model based on transfer learning and the abovementioned optimization model was introduced. Finally, a Chinese text dataset from a UAV power system created by the authors was used to validate the proposed method. The F1 score was 76.26%, the precision value was 76.98%, and the recall value was 75.56%, indicating that the TL_FLAT model was suitable for Chinese text entity recognition for UAV power systems.

show abstract

A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

Cited by 17 publications

References 45 publications

Multimodal Data Fusion for Few-shot Named Entity Recognition Method

Multimodal Data Fusion for Few-shot Named Entity Recognition Method

MIntRec: A New Dataset for Multimodal Intent Recognition

A TL_FLAT Model for Chinese Text Datasets of UAV Power Systems: Optimization and Performance

Contact Info

Product

Resources

About