Combining stochastic and rule-based methods for disambiguation in agglutinative languages

Ezeiza, Nerea; Alegria, Iñaki; Arriola, J. M.; Urizar, Ruben; Aduriz, I.

doi:10.3115/980451.980909

Cited by 18 publications

(15 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In the present study, we have chosen a different strategy (similar to the one described for other types of languages in (Tapanainen and Voutilainen, 1994), (Ezeiza et al, 1998) and (Hakkani-Tur et al, 2000)). At the same time, the rulebased component is known to perform well in eliminating the incorrect alternatives 2 , rather than picking the correct one under all circumstances.…”

Section: System Combinationmentioning

confidence: 99%

Serial combination of rules and statistics

Hajič¹,

Krbec²,

Kvėtoň³

et al. 2001

Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01

View full text Add to dashboard Cite

A hybrid system is described which combines the strength of manual rulewriting and statistical learning, obtaining results superior to both methods if applied separately. The combination of a rule-based system and a statistical one is not parallel but serial: the rule-based system performing partial disambiguation with recall close to 100% is applied first, and a trigram HMM tagger runs on its results. An experiment in Czech tagging has been performed with encouraging results.

show abstract

Section: System Combinationmentioning

confidence: 99%

Serial combination of rules and statistics

Hajič¹,

Krbec²,

Kvėtoň³

et al. 2001

Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01

View full text Add to dashboard Cite

show abstract

“…With regard to the feature vectors, the computation of the POS information was performed using the Eustagger toolkit [34] and ixa-pipe-pos [35] for the Basque and Spanish languages respectively. In addition, the time-codes at word level were obtained through the audio forced-alignment algorithms presented in [36] for both languages.…”

Section: Basque Corpus Spanish Large Corpusmentioning

confidence: 99%

Improving the automatic segmentation of subtitles through conditional random field

lvarez

Martnez-Hinarejos²,

Arzelus

et al. 2017

Speech Communication

View full text Add to dashboard Cite

Balenciaga, M.; Del Pozo, A. (2017). Improving the automatic segmentation of subtitles through conditional random field. Speech Communication. 88:83-95. AbstractAutomatic segmentation of subtitles is a novel research field which has not been studied extensively to date. However, quality automatic subtitling is a real need for broadcasters which seek for automatic solutions given the demanding European audiovisual legislation. In this article, a method based on Conditional Random Field is presented to deal with the automatic subtitling segmentation. This is a continuation of a previous work in the field, which proposed a method based on Support Vector Machine classifier to generate possible candidates for breaks. For this study, two corpora in Basque and Spanish were used for experiments, and the performance of the current method was tested and compared with the previous solution and two rule-based systems through several evaluation metrics. Finally, an experiment with human evaluators was carried out with the aim of measuring the productivity gain in post-editing automatic subtitles generated with the new method presented. / Speech Communication 00 (2016) 1-21 only increment the percentage of subtitling in the TV and the Internet, but also request quality subtitles. As a result, the demand of automatic solutions for quality subtitling has grown fast in the audiovisual community.Several parameters take part in the definition of what the quality of subtitles is [1]. Apart from features related to subtitle layout, duration and text editing, subtitling segmentation is one of the most relevant, as it was demonstrated in [2], a study whose aim was to verify whether a correct text chunking in subtitles had an impact on both comprehension and reading speed using human evaluators. Even though important differences were not found in terms of comprehension, they demonstrated that a correct segmentation by phrase or by sentence significantly reduced the time needed to read subtitles. Furthermore, the strong need for proper segmentation is supported by the psycholinguistic literature on reading [3], where the consensual view is that subtitle lines should end at natural linguistic breaks to improve readability and reduce cognitive effort produced by poorly segmented text lines [4].In this article, a new method based on probabilistic Conditional Random Field is applied to the field of automatic subtitling segmentation for Basque and Spanish languages. This work is a continuation of the previous research presented in [5], in which Support Vector Machine and Logistic Regression classifiers were employed for the subtitling segmentation task in the Basque language. In the present study, the same Basque corpus was used in order to compare the performance using the new classification method. In addition, the work has been extended to the Spanish language. It allowed us to confirm that the new classification method employed was valid for different types of corpora and languages. Given that the results obtained in [5] by the Support V...

show abstract

“…Another work describing the combined approach is [10]. Its authors introduce the results of combining the statistical and deterministic methods on the basis of the Basque language.…”

Section: Rule-based Approachesmentioning

confidence: 99%

A Combined Approach to Part-of-Speech Homonymy Resolution

Batura¹,

Elena²

2017

bncc.cs

View full text Add to dashboard Cite

The Russian language has an inflective structure and does not have a strict word order, which generates processing problems such as part-of-speech homonymy. The paper addresses this issue. The existing approaches to resolving the morphological homonymy problem can be divided into the following groups: rule-based approaches, statistical approaches, machine learning approaches, and combined methods. In the paper, we showed that each approach has its advantages and disadvantages; however, we can achieve a much higher precision of the algorithm by combining several approaches. The combined method based on neural networks gives better results than others (98% precision obtained). We used the following features: normalizing substitutions, grammatical and syntactic characteristics, vector representation of the word, and word forms. All the experiments were performed on the part of the National Corpus of the Russian Language with homonymy resolution. The analysis of the corpus revealed that the most frequent types of homonymy occurred between function words: a particle vs an interjection (14%), and a preposition vs an interjection (13.2%).

show abstract

Combining stochastic and rule-based methods for disambiguation in agglutinative languages

Cited by 18 publications

References 0 publications

Serial combination of rules and statistics

Serial combination of rules and statistics

Improving the automatic segmentation of subtitles through conditional random field

A Combined Approach to Part-of-Speech Homonymy Resolution

Contact Info

Product

Resources

About