Proceedings of the 11th Linguistic Annotation Workshop 2017
DOI: 10.18653/v1/w17-0811
|View full text |Cite
|
Sign up to set email alerts
|

Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems

Abstract: With the advent of word representations, word similarity tasks are becoming increasing popular as an evaluation metric for the quality of the representations. In this paper, we present manually annotated monolingual word similarity datasets of six Indian languages -Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati. These languages are most spoken Indian languages worldwide after Hindi and Bengali. For the construction of these datasets, our approach relies on translation and re-annotation of word similarity d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 13 publications
(11 citation statements)
references
References 12 publications
0
11
0
Order By: Relevance
“…Datasets are available for some tasks for a few languages. The following are some of the prominent publicly available datasets 2 : word similarity (Akhtar et al, 2017), word analogy (Grave et al, 2018), text classification, sentiment analysis (Akhtar et al, 2016;Mukku and Mamidi, 2017), paraphrase detection (Anand , QA (Clark et al, 2020;, discourse mode classification (Dhanwal et al, 2020), etc.. We also create datasets for some tasks, most of which span all major Indian languages. We bun- dle together the existing datasets and our newly created datasets to create the IndicGLUE benchmark.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Datasets are available for some tasks for a few languages. The following are some of the prominent publicly available datasets 2 : word similarity (Akhtar et al, 2017), word analogy (Grave et al, 2018), text classification, sentiment analysis (Akhtar et al, 2016;Mukku and Mamidi, 2017), paraphrase detection (Anand , QA (Clark et al, 2020;, discourse mode classification (Dhanwal et al, 2020), etc.. We also create datasets for some tasks, most of which span all major Indian languages. We bun- dle together the existing datasets and our newly created datasets to create the IndicGLUE benchmark.…”
Section: Related Workmentioning
confidence: 99%
“…Given the morphological richness of Indian languages we train FastText word embeddings which are known to be more effective for such languages. To evaluate these embeddings we curate a benchmark comprising of word similarity and analogy tasks (Akhtar et al, 2017;Grave et al, 2018), text classification tasks, sentence classification tasks (Akhtar et al, 2016;Mukku and Mamidi, 2017), and bilingual lexicon induction tasks. On most tasks, the word embeddings trained on our IndicCorp outperform similar embeddings trained on existing corpora for Indian languages.…”
Section: Introductionmentioning
confidence: 99%
“…For the translation from English into other languages, many researchers [2], [10], [11] use the strategy of involving two independent translators, and in case of disagreement on the translations, to have a third expert decide -we applied this approach, too, with translators being Thai academics who are fluent in English. As in Camacho-Collados et al [2], during translation, the annotators were presented the original similarity score of the word pair, in order to help selecting the correct translation for the intended meanings of the words.…”
Section: ) Dataset Translationmentioning
confidence: 99%
“…Sophisticated methods to tackle the problem of OOV words are beyond the scope of this work, but we implemented a baseline method to address the issue using the deepcut tokenizer. 11 As word segmentation is a crucial step in NLP-processing of Thai text, there exist a number of tools and research papers on the topic. In the past, different dictionary-based word segmentation approaches, such as longest-matching and maximal matching, have been employed.…”
Section: ) Evaluation Tool and Metricsmentioning
confidence: 99%
See 1 more Smart Citation