Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.141
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Deep Language and Dialect Identification for Short Texts

Abstract: Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
1
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 24 publications
0
1
0
Order By: Relevance
“…Character Encoding. The character level CNN layer generates word representation on n-gram (n ∈ {2,3,4,5,6}) characters, which helps to understand the representation of words on different sub-word levels (Goswami et al, 2020). A word's different sub-word level representations are achieved using a 1-dimensional CNN (Zhang et al, 2015).…”
Section: Word Encodermentioning
confidence: 99%
See 2 more Smart Citations
“…Character Encoding. The character level CNN layer generates word representation on n-gram (n ∈ {2,3,4,5,6}) characters, which helps to understand the representation of words on different sub-word levels (Goswami et al, 2020). A word's different sub-word level representations are achieved using a 1-dimensional CNN (Zhang et al, 2015).…”
Section: Word Encodermentioning
confidence: 99%
“…where k is the number of classes. We train the unsupervised model based on the maximum likelihood clustering loss proposed by Goswami et al (2020), where they try to maximize the probability distribution function for each class and at the same time try to minimize the probability of all the datasets to be assigned to one class using Equation 7.…”
Section: Weakly-supervised/unsupervised Cognate Detectormentioning
confidence: 99%
See 1 more Smart Citation
“…On the side of CNNs, we evaluate two concrete architectures: Kim_CNN (Kim et al, 2016) and Zhang_CNN (Zhang et al, 2015), which are known to perform well on the task of text classification and are widely used. On the side of RNNs, we evaluate the architecture Lin_SA_BiLSTM (Lin et al, 2017), which has been shown to give good results on the task of dialect classifications (Goswami et al, 2020). We manipulated the tokenizers of these models using different granularity levels without changing the overall architecture.…”
Section: Models For Classificationmentioning
confidence: 99%
“…akshara n-grams) work better than features based on bytes. Goswami et al (2020) experiment with supervised and unsupervised methods in dialect identification using, among others, a Dravidian data set containing Tamil, Telugu, Malayalam, and Kannada.…”
Section: Language Identification Of South Dravidian Languagesmentioning
confidence: 99%