CRF Models for Tamil Part of Speech Tagging and Chunking

Pandian, S. Lakshmana; Geetha, T. V.

doi:10.1007/978-3-642-00831-3_2

Cited by 21 publications

(3 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ROLE was a frequent scene element in comparison with STATIC/ANIMATED-OBJECT or other scene elements in those stories. The CRF model can handle this imbalanced data and learn the elements with a small number of samples, as in NER (Finkel, Grenager, and Manning 2005) and POS tagging problems (Pandian and Geetha 2009). The increase in the average accuracy of the mapping task in the third attempt (sequential modeling), that is, 85.7% compared with the second attempt (non-sequential modeling), that is, 76.58%, confirmed the ability of CRF to model sequential and imbalanced data.…”

Section: Conditional Random Fieldsmentioning

confidence: 99%

Recognition of visual scene elements from a story text in Persian natural language

2022

View full text Add to dashboard Cite

Text-to-scene conversion systems map natural language text to formal representations required for visual scenes. The difficulty involved in this mapping is one of the most critical challenges for developing these systems. The current study mapped Persian natural language text as the headmost system to a conceptual scene model. This conceptual scene model is an intermediate semantic representation between natural language and the visual scene and contains descriptions of visual elements of the scene. It will be used to produce meaningful animation based on an input story in this ongoing study. The mapping task was modeled as a sequential labeling problem, and a conditional random field (CRF) model was trained and tested for sequential labeling of scene model elements. To the best of the authors’ knowledge, no dataset for this task exists; thus, the required dataset was collected for this task. The lack of required off-the-shelf natural language processing modules and a significant error rate in the available corpora were important challenges to dataset collection. Some features of the dataset were manually annotated. The results were evaluated using standard text classification metrics, and an average accuracy of 85.7% was obtained, which is satisfactory.

show abstract

Section: Conditional Random Fieldsmentioning

confidence: 99%

Recognition of visual scene elements from a story text in Persian natural language

2022

View full text Add to dashboard Cite

show abstract

“…However, this is difficult to use on real data due to the complexity of natural languages. Some works are based on linear statistic models, such as Conditional Random Fields (CRF) [13] and Hidden Markov [14]. These statistic models perform relatively well on the corpora tagged with a coarse-grained tagset, but they do not perform as well as the Bi-LSTM on the corpora tagged with a fine-grained tagset [15].…”

Section: Pos Taggingmentioning

confidence: 99%

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

Mao

Wang

2021

Electronics

View full text Add to dashboard Cite

Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.

show abstract

“…Recently among Asian languages, several supervised learning techniques with acceptable performance have been proposed. For PoS tagging, Pandian and Geetha [1] utilized conditional random fields (CRFs), a probabilistic model, to segment and label sequence data, to tag and chunk PoS in Tamil. Huang et al [2] showed that a bigram PoS tagger using latent annotations could achieve the accuracy of 94.78% when testing on a set of the Penn Chinese Treebank 6.0.…”

Section: Introductionmentioning

confidence: 99%

Multi-Stage Automatic NE and PoS Annotation Using Pattern-Based and Statistical-Based Techniques for Thai Corpus Construction

Tongtep

Theeramunkong

2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Nattapong TONGTEP†a) , Student Member and Thanaruk THEERAMUNKONG †b) , Member SUMMARY Automated or semi-automated annotation is a practical solution for large-scale corpus construction. However, the special characteristics of Thai language, such as lack of word-boundary and sentenceboundary markers, trigger several issues in automatic corpus annotation. This paper presents a multi-stage annotation framework, containing two stages of chunking and three stages of tagging. The two chunking stages are pattern matching-based named entity (NE) extraction and dictionarybased word segmentation while the three succeeding tagging stages are dictionary-, pattern-and statist09812490981249ical-based tagging. Applying heuristics of ambiguity priority, NE extraction is performed first on an original text using a set of patterns, in the order of pattern ambiguity. Next, the remaining text is segmented into words with a dictionary. The obtained chunks are then tagged with types of named entities or parts-of-speech (PoS) using dictionaries, patterns and statistics. Focusing on the reduction of human intervention in corpus construction, our experimental results show that the dictionary-based tagging process can assign unique tags to 64.92% of the words, with the remaining of 24.14% unknown words and 10.94% ambiguously tagged words. Later, the pattern-based tagging can reduce unknown words to only 13.34% while the statistical-based tagging can solve the ambiguously tagged words to only 3.01%.

show abstract

CRF Models for Tamil Part of Speech Tagging and Chunking

Cited by 21 publications

References 6 publications

Recognition of visual scene elements from a story text in Persian natural language

Recognition of visual scene elements from a story text in Persian natural language

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

Multi-Stage Automatic NE and PoS Annotation Using Pattern-Based and Statistical-Based Techniques for Thai Corpus Construction

Contact Info

Product

Resources

About