2018
DOI: 10.1162/tacl_a_00033
|View full text |Cite
|
Sign up to set email alerts
|

Universal Word Segmentation: Implementation and Interpretation

Abstract: Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and nega… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
30
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 38 publications
(34 citation statements)
references
References 16 publications
0
30
0
Order By: Relevance
“…We use the default parameter settings introduced by Shao et al (2018) and train a segmentation model for all treebanks with at least 50 sentences of training data. For treebanks with less or no training data (except Thai discussed below), we substitute a model for another treebank/language:…”
Section: Sentence and Word Segmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…We use the default parameter settings introduced by Shao et al (2018) and train a segmentation model for all treebanks with at least 50 sentences of training data. For treebanks with less or no training data (except Thai discussed below), we substitute a model for another treebank/language:…”
Section: Sentence and Word Segmentationmentioning
confidence: 99%
“…The Uppsala system focuses exclusively on LAS and MLAS, and consists of a three-step pipeline. The first step is a model for joint sentence and word segmentation which uses the BiRNN-CRF framework of Shao et al (2017Shao et al ( , 2018 to predict sentence and word boundaries in the raw input and simultaneously marks multiword tokens that need non-segmental analysis. The second component is a part-of-speech (POS) tagger based on Bohnet et al (2018), which employs a sentencebased character model and also predicts morphological features.…”
Section: Introductionmentioning
confidence: 99%
“…During the review period of this paper, a paper byShao et al (2018) appeared which nearly matches the performance of yap on Hebrew segmentation using an RNN approach. Achieving an F-score of 91.01 compared to yap's score of 91.05, but on a dataset with slightly different splits, this system gives a good baseline for a tuned RNN-based system.…”
mentioning
confidence: 77%
“…While the early works belonging to this category relied on "traditional" classification techniques, such as maximum entropy models [40] and Conditional Random Fields [41], in recent studies neural architectures are being actively explored [23,27,28,30,42]. In 2018, Shao et al [43] released a language-independent character sequence tagging model based on recurrent neural networks with Conditional Random Fields interface, designed for performing word segmentation in the Universal Dependencies framework. It obtained state-of-the-art accuracies on a wide range of languages.…”
Section: Related Workmentioning
confidence: 99%