2017
DOI: 10.5755/j01.itc.46.4.18066
|View full text |Cite
|
Sign up to set email alerts
|

Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration

Abstract: In this research we compare two approaches (in particular, character-based machine learning and language modeling) and according to their results offer the best solution for the diacritization problem solving. Parameters of tested approaches (i.e., a huge variety of feature types for the character-based method and a value n for the n-gram language modeling method) were tuned to achieve the highest accuracy. Despite the main focus is on the Lithuanian language, we posit that obtained findings can also be applie… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 13 publications
0
7
0
Order By: Relevance
“…Linguistic studies have shown that the emergence of the current word is strongly dependent on many of the words before it [22]. Language models provide a way to calculate the probability of a string appearing.…”
Section: Semantic-based Disambiguation Model For Machine Translationmentioning
confidence: 99%
“…Linguistic studies have shown that the emergence of the current word is strongly dependent on many of the words before it [22]. Language models provide a way to calculate the probability of a string appearing.…”
Section: Semantic-based Disambiguation Model For Machine Translationmentioning
confidence: 99%
“…In [28], the character-level and word-level approaches are compared for the Lithuanian language. The authors used conditional random fields (CRF) as the sequence classifier by applying them to the character-level features.…”
Section: Character-level Approachesmentioning
confidence: 99%
“…In [28] character-level and word-level approaches are compared for the Lithuanian language. The authors used conditional random fields (CRF) as the sequence classifier by applying them to the character-level features.…”
Section: Character-level Approachesmentioning
confidence: 99%

Correcting diacritics and typos with a ByT5 transformer model

Stankevičius,
Lukoševičius,
Kapočiūtė-Dzikienė
et al. 2022
Preprint
Self Cite