Liner2 — a Generic Framework for Named Entity Recognition

Marcińczuk, Michał; Kocoń, Jan; Oleksy, Marcin

doi:10.18653/v1/w17-1413

Cited by 13 publications

(10 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The baseline models use a set of features used for named entity recognition for Polish Marcińczuk and Kocoń (2013); Marcińczuk et al (2013). We added new features (described in Section 6.1) to a baseline set.…”

Section: Discussionmentioning

confidence: 99%

“…Our approach is based on Liner2 tool, 11 which uses CRF++ toolkit 12 . This tool was successfully used in other Natural Language Engineering tasks, mainly in Named Entities Recognition Marcińczuk and Kocoń (2013); Marcińczuk et al (2013). We described our first approach to recognise timexes using this tool Kocoń and Marcińczuk (2015) and this work extends that research.…”

Section: Recognitionmentioning

confidence: 99%

“…We used the following features as a baseline. These features were used in Named Entities Recognition task during our previous research Marcińczuk and Kocoń (2013); Marcińczuk et al (2013). Morphosyntactic – lemma, grammatical class, case, number, gender. Orthographic – word, word shape (pattern), prefix, suffix, starts with upper case, starts with lower case, starts with symbol, starts with digit, has upper case, has symbol, has digit. Semantic – word synonym, hypernym. Dictionary – person first name, person last name, country name, city name, road name, person prefix, country prefix, person noun, person suffix, road prefix, specific triggers (country, district, geographic name, organisation name, person name, region, settlement). …”

Section: Recognitionmentioning

confidence: 99%

See 2 more Smart Citations

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes

Kocoń¹,

Marcińczuk²

2016

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Information Extraction systems, such as question answering or event recognition. We prepared a broad specification of Polish timexes – PLIMEX. It is based on the state-of-the-art annotation guidelines for English, mainly TIMEX2 and TIMEX3 (a part of TimeML – Markup Language for Temporal and Event Expressions). We have expanded our specification for a description of the local meaning of timexes, based on LTIMEX annotation guidelines for English. Temporal description supports further event identification and extends event description model, focussing on anchoring events in time, events ordering and reasoning about the persistence of events. We prepared the specification, which is designed to address these issues, and we annotated all documents in Polish Corpus of Wroclaw University of Technology (KPWr) using our annotation guidelines. We also adapted our Liner2 machine learning system to recognise Polish timexes and we propose two-phase method to select a subset of features for Conditional Random Fields sequence labelling method. This article presents the whole process of corpus annotation, evaluation of inter-annotator agreement, extending Liner2 system with new features and evaluation of the recognition models before and after feature selection with the analysis of statistical significance of differences. Liner2 with presented models is available as open source software under the GNU General Public License.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Recognitionmentioning

confidence: 99%

Section: Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes

Kocoń¹,

Marcińczuk²

2016

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The task was somewhat different from the 2019 task in that training data was not provided to participants. Approaches submitted to this task included a model based on parallel projection and a model with language-specific features trained on found data (Marcińczuk et al, 2017). There has also been follow-up work on this dataset using cross-lingual embeddings (Sharoff, 2018).…”

Section: Related Workmentioning

confidence: 99%

Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

2019

View full text Add to dashboard Cite

The paper presents an unsupervised method for quickly extending a Ukrainian lexicon by generating paradigms and morphological feature structures for new proper names and neologisms, which are not covered by existing static morphological resources. This approach addresses a practical problem of modelling paradigms for entities created by the dynamic processes in the lexicon: this problem is especially serious for highly-inflected languages in domains with specialised or quickly changing lexicon. The method uses an unannotated Ukrainian corpus and a small fixed set of inflection tables, which can be found in traditional grammar textbooks. The advantage of the proposed approach is that updating the morphological lexicon does not require training or linguistic annotation, allowing fast knowledge-light extension of an existing static lexicon to improve morphological coverage on a specific corpus. The method is implemented in an open-source package on a GitHub repository. It can be applied to other low-resourced inflectional languages which have internet corpora and linguistic descriptions of their inflection system, following the example of inflection tables for Ukrainian. Evaluation results show consistent improvements in coverage for Ukrainian corpora of different corpus types.

show abstract

“…The details of tuning Liner2 to tackle the shared task are described in (Marcińczuk et al, 2017). The team (code "pw") attempted only the Polish-language Challenge.…”

Section: Participant Systemsmentioning

confidence: 99%

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

2017

View full text Add to dashboard Cite

The Workshops have been convening for over a decade, with a clear vision and purpose. On one hand, the languages from the Balto-Slavic group play an important role due to their widespread use and diverse cultural heritage. These languages are spoken by about one third of all speakers of the official languages of the European Union, and by over 400 million speakers worldwide. The political and economic developments in Central and Eastern Europe place societies where Balto-Slavic languages are spoken at the center of rapid technological advancement and the growing European consumer markets.On the other hand, research on theoretical and applied NLP in some of these languages still lags behind the "major" languages, such as English and other West European languages. In comparison to English, which has dominated the digital world since the advent of the Internet, many of these languages still lack resources, processing tools and applications-especially those with smaller speaker bases.The Balto-Slavic languages pose a wealth of fascinating scientific challenges. The linguistic phenomena specific to the Balto-Slavic languages-complex morphology and free word order-present non-trivial problems for construction of NLP tools, and require rich morphological and syntactic resources. This view is also reflected in Serge Sharoff's invited talk on "Pan-Slavic NLP." In the talk, he discusses an ambitious project on language adaptation-ways to adapt tools and resources among closely related languages, such as those in the Slavic group.The BSNLP Workshops aim to bring together academic researchers and industry specialists in NLP for Balto-Slavic languages. We aim to stimulate research and to foster the creation and dissemination of tools and resources. The Workshop serves as a forum for exchange of ideas and experience and for discussing shared problems. One fascinating aspect of this group of languages is their structural similarity, as well as an easily recognizable lexical and inflectional inventory spanning the entire group, which-despite the lack of mutual intelligibility-creates a special environment in which researchers can fully appreciate the shared problems and solutions.As a result of discussions at the previous BSNLP Workshops, to help catalyze collaboration, this year we have organized the first SIGSLAV Challenge: a shared task on multilingual named entity recognition. We have built a dataset, which allows systems to be evaluated on recognizing mentions of named entities in Web documents, their normalization/lemmatization, and cross-lingual matching. The Challenge initially covers seven Slavic languages, and it is intended as a first version of an evaluation standard to be expanded in the future.We received 24 regular submissions, 14 of which were accepted for presentation.The papers cover a wide range of topics. Two papers relate to lexical semantics, four to development of linguistic resources, and four to information filtering, information retrieval, and information extraction. Four papers cover topics related to...

show abstract

Liner2 — a Generic Framework for Named Entity Recognition

Abstract: In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.

Cited by 13 publications

References 6 publications

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes

Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Contact Info

Product

Resources

About