Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Indig, Balázs; Endrédy, István

doi:10.1007/978-3-319-75477-2_29

Cited by 2 publications

(6 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The lexicalization invented by Molina and Pla [4] is thoroughly investigated by Indig and Endrédy [2] and they present a mildly lexicalized variant (see Table 2) of the method that has superior performance.…”

Section: Lexicalizationmentioning

confidence: 99%

“…The latter problem can be overcome by using a proper IOB converter which can also fix well-formedness issues (see Section 2 for details). The main motivation of the approach of Indig and Endrédy [2] was to reduce the number of labels because for agglutinative languages the original method is not feasible due to the high number of tags even without any lexicalization. But we also remark that this problem also exists with low thresholds used during lexicalization.…”

Section: Lexicalizationmentioning

confidence: 99%

“…The CoNLL-2000 dataset [7] is the de facto standard dataset for measuring tagger performance in chunking for English. The current state-of-the-art method Gut, Besser, Chunker [2] achieves the F-score of 95.06% (96.49% for NPs).…”

Section: Introductionmentioning

confidence: 96%

“…Indig and Endrédy questioned the methodology behind SS05 and compared different taggers on different levels of lexicalization with different voting schemes [2]. The authors could not reproduce the previous results due to the many programming errors discovered in the original code.…”

Section: Introductionmentioning

confidence: 99%

“…We evaluate all five previous representations, even though it was shown that IOBES has superior performance to the other four [2].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Less is More, More or Less... Finding the Optimal Threshold for Lexicalization in Chunking

Indig

2018

CyS

View full text Add to dashboard Cite

Abstract. Lexicalization of the input of sequential taggers has gone a long way since it was invented by Molina and Pla [4]. In this paper we thoroughly investigate the method introduced by Indig and Endrédy [2] to find out the best lexicalization level for chunking and to explore the behavior of different IOB representations. Both tasks are applied to the CoNLL-2000 dataset. Our goal is to introduce a transformation method to accommodate the parameters of the development set to the training set using their frequency distributions which other tasks like POS tagging or NER could benefit too.

show abstract

Section: Lexicalizationmentioning

confidence: 99%

Section: Lexicalizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 96%

Section: Introductionmentioning

confidence: 99%

“…We evaluate all five previous representations, even though it was shown that IOBES has superior performance to the other four [2].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Less is More, More or Less... Finding the Optimal Threshold for Lexicalization in Chunking

Indig

2018

CyS

View full text Add to dashboard Cite

show abstract

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

2020

View full text Add to dashboard Cite

Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.

show abstract

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Cited by 2 publications

References 8 publications

Less is More, More or Less... Finding the Optimal Threshold for Lexicalization in Chunking

Less is More, More or Less... Finding the Optimal Threshold for Lexicalization in Chunking

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

Contact Info

Product

Resources

About