Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) 2023
DOI: 10.18653/v1/2023.vardial-1.2
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing the Size of Subword Vocabularies in Dialect Classification

Vani Kanjirangat,
Tanja Samardžić,
Ljiljana Dolamic
et al.

Abstract: Pre-trained models usually come with a predefined tokenization and little flexibility as to what subword tokens can be used in downstream tasks. This problem concerns especially multilingual NLP and low-resource languages, which are typically processed using cross-lingual transfer. In this paper, we aim to find out if the right granularity of tokenization is helpful for a text classification task, namely dialect classification. Aiming at generalizations beyond the studied cases, we look for the optimal granula… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 34 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?