Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.99
|View full text |Cite
|
Sign up to set email alerts
|

Computational Historical Linguistics and Language Diversity in South Asia

Abstract: South Asia is home to a plethora of languages, many of which severely lack access to new language technologies. This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics-fields which necessitate the gathering of extensive data from many languages. We claim that data scatteredness (rather than scarcity) is the primary obstacle in the development of South Asian language technology, and suggest that the study of language history is … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 26 publications
0
6
0
Order By: Relevance
“…While the proportion of missing values was not manipulated in the Experiments, exploratory analysis found that larger non-missing proportions were more favorable: increasing the users assigned per utterance was beneficial for fixed total user pool; but for fixed number of users per utterance, increasing the user pool was detrimental. While having more data is statistically expedient, it also imposes a greater burden on the user (Shuster et al, 2022). In these Experiments, having 10% of the inter-rater matrix filled meant that each user labeled on average 20 utterances, which is 5x what the CV-based approach (Ju et al, 2022) would have required.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…While the proportion of missing values was not manipulated in the Experiments, exploratory analysis found that larger non-missing proportions were more favorable: increasing the users assigned per utterance was beneficial for fixed total user pool; but for fixed number of users per utterance, increasing the user pool was detrimental. While having more data is statistically expedient, it also imposes a greater burden on the user (Shuster et al, 2022). In these Experiments, having 10% of the inter-rater matrix filled meant that each user labeled on average 20 utterances, which is 5x what the CV-based approach (Ju et al, 2022) would have required.…”
Section: Discussionmentioning
confidence: 99%
“…For chatbots, it is not enough to learn to generate coherent utterances-many coherent utterances are undesirable, such as offensive remarks. Postdeployment, to have the chatbot continuously improve its judgment, one way is to source training examples from feedback by live organic users (Shuster et al, 2022;. Though more realistic than crowdworkers, among organic users are trolls, whose erroneous feedback fosters bad behavior (Wolf et al, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…Our work specifically focuses on improving TOD performance, and hence we caution users against any potential unsafe/toxic/offensive responses generated from the models. Without safety guardrails such as Arora et al (2022); Lu et al (2022), we do not advocate using any of our trained models in production settings.…”
Section: Ethical Considerationsmentioning
confidence: 99%
“…The reward function R would represent the "utility" of each action contributing towards the overall performance, such as task success in TOD. Typically, this is modeled by using R(s, a) = 0 for non-terminal states, and R(s T , a) for terminal states which can be computed by combining scores such as task success and BLEU (Ramamurthy et al, 2022;Arora et al, 2022). The transition function P : S × A → S would deterministically append the action a t to the current state s t so that s t+1 = (c 0 , .…”
Section: Introductionmentioning
confidence: 99%
“…Chinese [7-9, 16, 42, 50] and Japanese [37]), but it consumes considerable computing resources for each language, which is not eco-friendly. In addition, it is unlikely to collect massive dialogue sessions of some languages due to the data scarcity and/or scatteredness problem [5]. Another line of investigation has focused on adaptation from pre-trained English models [1,2].…”
mentioning
confidence: 99%