2019
DOI: 10.22364/bjmc.2019.7.4.04
|View full text |Cite
|
Sign up to set email alerts
|

Subword Segmentation for Machine Translation Based on Grouping Words by Potential Roots

Abstract: This paper proposes a new subword segmentation method for machine translation. The algorithm, which we call GenSeg, is generic in the sense that it can be applied to any language, but is designed with an emphasis on inflectional splitting, i.e. it attempts to split words on boundaries corresponding to inflectional suffixes. The main principle of the method is grouping together words that share a common middle substring, and then separating the best such substring from the rest of the word. GenSeg is a cross-la… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 7 publications
0
1
0
Order By: Relevance
“…For segmentation of Latvian text, we have applied a GenSeg tool, described in (Zuters and Strazds, 2019) to preprocess the dialog file, so that the input now consisted of the messages in an already segmented form, leaving the rest of the process exactly as before -so that the run of the model on segmented versus unsegmented data differed only in the input file. Having fixed the metaparameters at 64 hidden units and vocabulary size 100, we found that subword segmentation improved the resulting model accuracy by 1.25% (z-score = -4.02).…”
Section: Methodsmentioning
confidence: 99%
“…For segmentation of Latvian text, we have applied a GenSeg tool, described in (Zuters and Strazds, 2019) to preprocess the dialog file, so that the input now consisted of the messages in an already segmented form, leaving the rest of the process exactly as before -so that the run of the model on segmented versus unsegmented data differed only in the input file. Having fixed the metaparameters at 64 hidden units and vocabulary size 100, we found that subword segmentation improved the resulting model accuracy by 1.25% (z-score = -4.02).…”
Section: Methodsmentioning
confidence: 99%