Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy

Silfverberg, Miikka; Ruokolainen, Teemu; Lindén, Krister; Kurimo, Mikko

doi:10.3115/v1/p14-2043

Cited by 20 publications

(11 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CRF has been used for many different tasks, especially dealing with sequence labeling such as POS tagging (Lafferty et al, 2001a;Silfverberg et al, 2014) and named entity recognition (McCallum and Li, 2003;Settles, 2004). Similar to us, three out of seven participating teams also used CRF for codeswitching detection for the EMNLP 2014 language identification shared task (Solorio et al, 2014).…”

Section: Related Workmentioning

confidence: 89%

Codeswitching Detection via Lexical Features in Conditional Random Fields

Shrestha¹

2016

Proceedings of the Second Workshop on Computational Approaches to Code Switching

View full text Add to dashboard Cite

A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup base-line system and argue for a more purpose-specific evaluation scheme.

show abstract

Section: Related Workmentioning

confidence: 89%

Codeswitching Detection via Lexical Features in Conditional Random Fields

Shrestha¹

2016

Proceedings of the Second Workshop on Computational Approaches to Code Switching

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 90%

“…Maharjan et al (2015) collected codeswitched tweets for Spanish-English and Nepali-English language pairs. They first figured out some seed users who codeswitched frequently and then followed him/her to collect more codeswitched tweets.They obtained an accuracy of 86% and 87% for Spanish-English and Nepali-English dataset using CRF GE algorithm.CRF has been used for many different tasks, especially dealing with sequence labeling such as POS tagging (Lafferty et al, 2001a;Silfverberg et al, 2014) and named entity recognition (McCallum and Li, 2003;Settles, 2004). Similar to us, three out of seven participating teams also used CRF for codeswitching detection for the EMNLP 2014 language identification shared task .…”

mentioning

confidence: 90%

Proceedings of the Second Workshop on Computational Approaches to Code Switching

2016

View full text Add to dashboard Cite

ii Introduction Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is pervasive in informal text communications such as news groups, tweets, blogs, and other social media of multilingual communities. Such genres are increasingly being studied as rich sources of social, commercial and political information. Apart from the informal genre challenge associated with such data within a single language processing scenario, the CS phenomenon adds another significant layer of complexity to the processing of the data. Efficiently and robustly processing CS data presents a new frontier for our NLP algorithms on all levels. The goal of this workshop is to bring together researchers interested in exploring these new frontiers, discussing state of the art research in CS, and identifying the next steps in this fascinating research area.The workshop program includes exciting papers discussing new approaches for CS data and the development of linguistic resources needed to process and study CS. We received a total of 12 regular workshop submissions of which we accepted nine for publication four of them as workshop talks and five as posters. The accepted workshop submissions cover a wide variety of language combinations from languages such as English, Hindi, Swahili, Mandarin, Dialectical Arabic and Modern Standard Arabic. The majority of the papers focus on social media data such as Twitter, and discussion fora.Another component of the workshop is the Second Shared Task on Language Identification of CS Data. The shared task focused on social media and included two language pairs: Modern Standard ArabicDialectal Arabic and English-Spanish. We received a total of 14 system runs from nine different teams. All teams except one submitted a shared task paper describing their system. All shared task systems will be presented during the workshop poster session and two of them will also present a talk. We would like to thank all authors who submitted their contributions to this workshop and all shared task participants for taking on the challenge of language identification in code switched data. We also thank the program committee members for their help in providing meaningful reviews. Lastly, we thank the EMNLP 2016 organizers for the opportunity to put together this workshop. AbstractThis paper addresses challenges of Natural Language Processing (NLP) on non-canonical multilingual data in which two or more languages are mixed. It refers to code-switching which has become more popular in our daily life and therefore obtains an increasing amount of attention from the research community. We report our experience that covers not only core NLP tasks such as normalisation, language identification, language modelling, part-of-speech tagging and dependency parsing but also more downstream ones such as machine translation and automatic speech recognition. We highlight and discuss the key problems for each of the tasks with supporting...

show abstract

“…A typical example of such a structured morphological label is the label Noun|Sg|Nom, which consists of three sub units: the main word class Noun, the singular number Sg and the nominative case Nom. FinnPos utilizes the internal structure of complex labels by extracting features for sub-units as well as for the entire labels [19]. This alleviates the data sparsity problem because features relating to sub-units of entire tags are used as fall-back.…”

Section: Finnpos For Morphologically Rich Languagesmentioning

confidence: 99%