This article describes the first steps towards a open-source dependency treebank for Erzya based on universal dependency (UD) annotation standards. The treebank contains 610 sentences with 6661 tokens and is based on texts from a range of open-source and public domain original Erzya sources. This ensures its free availability and extensibility. Texts in the treebank are first morphologically analyzed and disambiguated after which they are annotated manually for dependency structure. In the article we present some issues in dependency syntax for Erzya and how they are analyzed in the universal-dependency framework. Preliminary statistics are given for dependency parsing of Erzya, along with points of interest for future research. TiivistelmäTässä artikkelissa kerrotaan ersän kielen avoimen puupankin ensimmäisistä askeleista, joissa sovelletaan universaaliriippuvuus-annotaatiota (UD). Puupankki sisältää 610 virkettä joissa on yhteensä 6661 tokenia ja se perustuu avoimeen ersänkieliseen originaalikirjoituksiin. Tällä tavalla varmistetaan puupankin saatavuutta ja laajennettavuutta. Puupankin tekstit on ensin analysoitu morfologisella jäsentimellä ja disambiguoitu, minkä jälkeen suoritetaan loppuyksiselitteistäminen käsin ja lisätään riippuvuussuhteet. Artikkelissa esitetään joitakin kysymyksiä, jotka esiintyvät ersän lauseoppia sovellettaessa universaaliriippuvuuskehyksiin. Annetaan alkutilastoja ersän jäsennyksestä sekä ajatuksia tulevan tutkimuksen näkemyksistä. AbstractТе статиясонть сёрмадтано эрзянь келень од ресурсадо, конась весеменень панжадо, чувтокс валрисьмень пурнавксто, чувтонь банкто, ды юртонзо путомадо. Валрисьмень анализэнь теемстэ нолдави тевс масторлангонь вейсэнь аннотация, конаньсэ невтеви валрисьме пелькстнэнь вейкест-вейкест эйстэ чувтокс аштема лувост (Universal Dependency UD). Статиянть сёрмадомсто чувтонь банкось ашти 610 валрисьмеде, косо весемезэ 6661 токент (валт-лотксема тешкст), материалось ашти весеменень панжадо эрзякс сёрмадозь литературанть эйстэ. Истя чувтонь банкось саеви-келейгавтови кинень мелезэ -ресурсась ванстсы оляксчинзэ. Васня пурнавксонь валрисьметненень тееви морфологиянь анализ, конась мейле седе вадрялгавтови синтаксисэнь анализсэ. 109Мейле келень ванкшныцясь сонсь невти кона пелькстнэ конатнень эйстэ аштить. Статиясонть макстано зярыя кевкстемат, конат чачить эрзянь кель UD марто вастневемстэ. Макстано эрзянь келень анализдэ васнянь статистика ды арсемат-мельть келень ванкшномань сыця ёнкстнэде-тевтнеде.
This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.
We present an open source Python library to automatically produce syntactically correct Finnish sentences when only lemmas and their relations are provided. The tool resolves automatically morphosyntax in the sentence such as agreement and government rules and uses Omorfi to produce the correct morphological forms. In this paper, we discuss how case government can be learned automatically from a corpus and incorporated as a part of the natural language generation tool. We also present how agreement rules are modelled in the system and discuss the use cases of the tool such as its initial use as part of a computational creativity system, called Poem Machine. Tiivistelmä Tässä artikkelissa esittelemme avoimen lähdekoodin Python-kirjaston kieliopillisten lauseiden automaattista tuottamista varten suomen kielelle. Kieliopilliset rakenteet pystytään tuottamaan pelkkien lemmojen ja niiden välisten suhteiden avulla. Työkalu ratkoo vaadittavan morfosyntaktiset vaatimukset kuten kongruenssin ja rektion automaattisesti ja tuottaa morfologisesti oikean muodon Omorfin avulla. Esittelemme tavan, jolla verbien rektiot voidaan poimia automaattisesti korpuksesta ja yhdistää osaksi NLG-järjestelmää. Esittelemme, miten kongruenssi on mallinnettu osana järjestelmää ja kuvaamme työkalun alkuperäisen käyttötarkoituksen osana laskennallisesti luovaa Runokone-järjestelmää.
Open-source analyzer dictionary development is being implemented for Skolt Sami, Ingrian, Moksha-Mordvin, etc. in the Helsinki CSC infrastructure; home of the Finnish Kielipankki 'Language Bank' and Termipankki 'Term Bank'. The proximity of minority-language corpora in need of annotation and the multiple usage of controlled wikimedia-type dictionaries make CSC an attractive site for synchronized transducer dictionary development. The open-source FST development of Uralic and other minority languages at Giellatekno-Divvun in Tromsø demonstrates a vast potential for reusage of FST-s, only augmented by opensource work in OmorFi, Apertium and Universal Dependency
Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which is useful in many language processing applications. is paper investigates language modeling in the open-source OCR engine Tesseract using morphological analyzers. We present experiments on two Uralic languages Finnish and Erzya. According to our experiments, word lists may still be superior to morphological analyzers in OCR even for languages with rich morphology. Our error analysis indicates that morphological analyzers can cause a large amount of real word OCR errors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.