Proceedings of the Sixth Workshop On 2019
DOI: 10.18653/v1/w19-1405
|View full text |Cite
|
Sign up to set email alerts
|

Modeling Global Syntactic Variation in

Abstract: This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 37 publications
0
4
0
Order By: Relevance
“…Most previous studies on identification of English varieties were corpus-based (Lui and Cook, 2013;Utomo and Sibaroni, 2019;Cook and Hirst, 2012;Dunn, 2019;Simaki et al, 2017;Rangel et al, 2017). The advantage of corpus-based classification is that as the model is trained on actual text collections, it could show the differences in the varieties as they appear "in the wild", and researchers do not need a profound knowledge of lexical differences between the varieties that linguists are aware of.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Most previous studies on identification of English varieties were corpus-based (Lui and Cook, 2013;Utomo and Sibaroni, 2019;Cook and Hirst, 2012;Dunn, 2019;Simaki et al, 2017;Rangel et al, 2017). The advantage of corpus-based classification is that as the model is trained on actual text collections, it could show the differences in the varieties as they appear "in the wild", and researchers do not need a profound knowledge of lexical differences between the varieties that linguists are aware of.…”
Section: Related Workmentioning
confidence: 99%
“…The advantage of corpus-based classification is that as the model is trained on actual text collections, it could show the differences in the varieties as they appear "in the wild", and researchers do not need a profound knowledge of lexical differences between the varieties that linguists are aware of. To obtain reference datasets that are large enough to be used for training the model, researchers most often used or constructed web corpora (Atwell et al, 2007;Lui and Cook, 2013), using the national top-level domains as indicators of the text origin (e.g., .uk for British English), journalistic corpora (Zampieri et al, 2014), national corpora (Lui et al, 2014;Utomo and Sibaroni, 2019), such as the British National Corpus (BNC) (Consortium et al, 2007) , and/or social media corpora (Dunn, 2019;Simaki et al, 2017;Rangel et al, 2017), consisting of texts from Twitter and Facebook, where the variety is assigned to texts based on the metadata about the post or its author.…”
Section: Related Workmentioning
confidence: 99%
“…The advantage of corpus-based classification is that as the model is trained on actual text collections, it could show the differences in the varieties as they appear "in the wild", and researchers do not need a profound knowledge of lexical differences between the varieties that linguists are aware of. To obtain reference datasets that are large enough to be used for training the model, researchers most often used or constructed web corpora (Atwell et al, 2007;Lui and Cook, 2013), using the national top-level domains as indicators of the text origin (e.g., .uk for British English), journalistic corpora , national corpora Utomo and Sibaroni, 2019), such as the British National Corpus (BNC) (Consortium et al, 2007) , and/or social media corpora (Dunn, 2019;Simaki et al, 2017;Rangel et al, 2017), consisting of texts from Twitter and Facebook, where the variety is assigned to texts based on the metadata about the post or its author.…”
Section: Related Workmentioning
confidence: 99%
“…Machine learning feature selection techniques have been employed to discover dialect features from corpora. For example, Dunn (2018Dunn ( , 2019 induces a set of constructions (short sequences of words, parts-of-speech, or constituents) from a "neutral" corpus, and then identifies constructions with distinctive distributions over the geographical subcorpora of the International Corpus of English (ICE).…”
Section: Discovering and Detecting Dialect Featuresmentioning
confidence: 99%