This paper describes the representation of Basque Multiword Lexical Units and the automatic processing of Multiword Expressions. After discussing and stating which kind of multiword expressions we consider to be processed at the current stage of the work, we present the representation schema of the corresponding lexical units in a generalpurpose lexical database. Due to its expressive power, the schema can deal not only with fixed expressions but also with morphosyntactically flexible constructions. It also allows us to lemmatize word combinations as a unit and yet to parse the components individually if necessary. Moreover, we describe HABIL, a tool for the automatic processing of these expressions, and we give some evaluation results. This work must be placed in a general framework of written Basque processing tools, which currently ranges from the tokenization and segmentation of single words up to the syntactic tagging of general texts.
LaburpenaArtikulu honetan metodo estokastiko eta erregeletan oinarritutako metodoen arteko konbinaketa euskarari aplikatzearen emaitzak aurkeztuko ditugu.Desanbiguazioan erabilitako metodoak Murrizpen Gramatika (CG) eta MULTEXT proiektuak garatutako HMMn oinarritutako etiketatzailea dira. Euskara hizkuntza eranskaria izaki, hitz bakoitzari dagozkion irakurketa guztiak esleitzeko analizatzaile morfologikoa beharrezkoa da. Ondoren, CG erregelak informazio morfologiko guztiari aplikatzen zaizkio eta prozesu honek testuen anbiguotasuna gutxitzen du. Azkenik, geratutako etiketen artean bakarra hautatzeko MULTEXT proiektuko tresnak erabiltzen dira. Metodo estokastikoa soilik erabiltzean, errore-tasa %14 ingurukoa da, baina etiketatzailearen doitasuna hitz ezezagunekin lexikoa aberastuz gero %2 hobe daitekeen arren. Metodo biak konbinatzen direnean, berriz, prozesu osoaren errore-tasa % 3.5ekoa da. Ikasketarako corpusa nahikoa txikia dela, HMM eredua lehenengo mailakoa eta euskararako Murrizpen Gramatika oraindik ere garapen prozesuan dagoela kontuan izanik, gure ustez metodo konbinatu hau erabilita emaitza onak lor daitezke eta beste hizkuntza eranskarietarako bereziki egokia izan daiteke. ResumEn aquest article presentem els resultats de la combinaci6 de m~todes estoc/lstics i basats en regles aplicats a la desambiguaci6 morfosinthcfica de l'euskara. Els m6todes utilitzats per a la desambiguaci6 s6n: les Gramhtiques de Restrictions (CG) i l'etiquetador basat en HMM del projecte MULTEXT. E1 carhcter aglutinant de l'euskara fa necessari la utilitzaci6 d'un analitzador morfolbgic per assignar a cada paraula totes les seves interpretacions. Les regles de CG s'apliquen utilitzant la informaci6 morfol6gica completa i aquest proc6s redueix parcialment rambigtiitat dels textos. A continuaci6, s'apliquen les eines de MULTEXT per escollir una finica etiqueta. Utilitzant nom6s el m6tode estoc/lstic la taxa d'error 6s aproximadament del 14%, encara que la precisi6 de l'etiquetador es pot incrementar en un 2% utilitzant les paraules desconegudes per enriquir el 16xic. En canvi, la combinaci6 d'ambd6s m6todes permet reduir l'error fins al 3.5%. Tenint en compte que el corpus d'aprenentatge 6s bastant petit, que el model HMM 6s de primer ordre i que la Gramhtica de Restriccions de l'euskara esth encara en fase de desenvolupament, creiem que els resultats del m6tode combinat s6n bons i que la combinaci6 de m6todes 6s especialment adequada per a llengiies aglutinants. ResumenEn este articulo presentamos los resultados de la combinaci6n de m6todos estoc~sticos y basados en reglas aplicados al euskara. Los m6todos utilizados para la desambiguaci6n son las Gram~iticas de Restricciones (CG) y el etiquetador basado en HMM del proyecto MULTEXT. Siendo el euskara una lengua aglutinante, serfi necesario un analizador morfol6gico para asignar a cada palabra todas sus interpretaciones. A continuaci6n se aplican las reglas de CG ufilizando toda la informaci6n morfol6gica y este proceso disminuye la ambigtiedad de los textos. Por filfimo, las he...
The selection of appropriate Lexical Units (LUs) is an important issue in the development of Continuous Speech Recognition (CSR) systems. Words have been used classically as the recognition unit in most of them. However, proposals of non-word units are beginning to arise. Basque is an agglutinative language with some structure inside words, for which non-word morpheme like units could be an appropriate choice. In this work a statistical analysis of units obtained after morphological segmentation has been carried out. This analysis shows a potential gain of confusion rates in CSR systems, due to the growth of the set of acoustically similar and short morphemes. Thus, several proposals of Lexical Units are analysed to deal with the problem. Measures of Phonetic Perplexity and Speech Recognition rates have been computed using different sets of units and, based on these measures, a set of alternative non-word units have been selected.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.