computel 2019
DOI: 10.33011/computel.v1i.403
|View full text |Cite
|
Sign up to set email alerts
|

Seeing More Than Whitespace — Tokenisation and Disambiguation in a North Sámi Grammar Checker

Abstract: Communities of lesser resourced languages like North Sámi benefit from language tools such as spell checkers and grammar checkers to improve literacy. Accurate error feedback is dependent on well-tokenised input, but traditional tokenisation as shallow preprocessing is inadequate to solve the challenges of real-world language usage. We present an alternative where tokenisation remains ambiguous until we have linguistic context information available. This lets us accurately detect sentence boundaries, multiwor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
2
2
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 10 publications
0
7
0
Order By: Relevance
“…In our tokenisation, sentence boundary detection is treated as a special case of ambiguous tokenisation, and solved in the same way, approaching nearperfect sentence boundary identification, cf. Wiechetek et al (2019b).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In our tokenisation, sentence boundary detection is treated as a special case of ambiguous tokenisation, and solved in the same way, approaching nearperfect sentence boundary identification, cf. Wiechetek et al (2019b).…”
Section: Methodsmentioning
confidence: 99%
“…There is also support for neural network based models of spellchecking (Kaalep et al, 2022), this is however in its current stage still not up to par with the traditional weighted finitestate models given the current error corpus sizes. Since 2019 the GiellaLT infrastructure sup ports building grammar checkers (Wiechetek et al, 2019a) and these are available for some of the Sámi languages already. Another highlevel tool avail able within the GiellaLT infrastructure is machine translation.…”
Section: Methodsmentioning
confidence: 99%
“…words that are written apart and should be a compound) in GramDivvun in two ways. Firstly, we compare last year's results in Wiechetek (2019a) with a newer version of Gram-Divvun, from now on referred to as the Nodalida-corpus. Last year's results are based on version r183544 (Wiechetek et al, 2019a) .…”
Section: Patternmentioning
confidence: 99%
“…Firstly, we compare last year's results in Wiechetek (2019a) with a newer version of Gram-Divvun, from now on referred to as the Nodalida-corpus. Last year's results are based on version r183544 (Wiechetek et al, 2019a) . The new results are based on version r28510 of Gram-Divvun.…”
Section: Patternmentioning
confidence: 99%
“…Rules have been used and are in a wide-spread use in the context of endangered Uralic languages. There is recent work on grammar checking for North Sámi (Wiechetek et al, 2019a) and spell checking for Skolt Sámi (Trosterud and Moshagen, 2021). Other rule-based approaches to grammar checking are extensively described in Wiechetek (2017).…”
Section: Introductionmentioning
confidence: 99%