2007
DOI: 10.1075/ijcl.12.4.02cro
|View full text |Cite
|
Sign up to set email alerts
|

Multi-dimensional register classification using bigrams

Abstract: A corpus linguistic analysis investigated register classification using frequency of bigrams in nine spoken and two written corpora. Four dimensions emerged from a factor analysis using bigram frequencies shared across corpora: (1) Scripted vs. Unscripted Discourse, (2) Deliberate vs. Unplanned Discourse, (3) Spatial vs. Non-Spatial Discourse, and (4) Directional vs. Non-Directional Discourse. These findings were replicated in a second analysis. Both analyses demonstrate the strength of bigrams for classifying… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0
2

Year Published

2010
2010
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 48 publications
(14 citation statements)
references
References 14 publications
0
12
0
2
Order By: Relevance
“…For example, there is a wealth of studies that show that n-grams can be a good diagnostic or a good discriminatory tool in many corpus-linguistic and computational-linguistic domains, for example: − lexical n-grams are used for multidimensional register classification (cf. Crossley & Louwerse 2007), the study of academic English (cf. Biber, Conrad & Cortes 2004 and Simpson-Vlach & Ellis forthcoming), the identification of junk/spam emails (Orasan & Krishnamurthy 2002), etc.…”
Section: N-grams In Today's Corpus Linguisticsmentioning
confidence: 99%
“…For example, there is a wealth of studies that show that n-grams can be a good diagnostic or a good discriminatory tool in many corpus-linguistic and computational-linguistic domains, for example: − lexical n-grams are used for multidimensional register classification (cf. Crossley & Louwerse 2007), the study of academic English (cf. Biber, Conrad & Cortes 2004 and Simpson-Vlach & Ellis forthcoming), the identification of junk/spam emails (Orasan & Krishnamurthy 2002), etc.…”
Section: N-grams In Today's Corpus Linguisticsmentioning
confidence: 99%
“…It has been the focus of a range of corpus-based studies employing different terminologies, (e.g., pattern, collocation, colligation, multi-word units, lexical bundles, n-gram, construction, among others), but all emphasise the inter-dependence of form and meaning (Biber, 2006;Biber et al, 1999Hoey, 2005;Hunston and Francis, 2000;Hyland, 2008;and Goldberg, 2006). Crossley and Louwerse (2007) classify registers using the frequency of bigrams shared among nine spoken and two written corpora, the findings of which demonstrate that the phrasal units and grammatical constructions can function as a powerful approach to MD analysis. Indeed, as Gries et al (2011) observe, 'a pure n-gram-based approach can be used as an initial, computationally cheap, way of classifying corpus registers that produces useful results.…”
Section: Selection Of Linguistic Features For Factor Analysismentioning
confidence: 99%
“…A total of 141 linguistic features are used in the study of world Englishes, language variation across different registers and world English varieties. Crossley and Louwerse (2007) introduce bigrams into the MD analytical framework, demonstrating its strength for classifying spoken and written registers. Since we did not know, before the model was established, which linguistic features would be sufficiently strong and significant, we followed Biber's (1995) suggestion that as many features as possible should be included, initially, and at the lowest possible level of groupings.…”
Section: An Introduction To MD Analysismentioning
confidence: 99%
“…It allows for any form of coding, grounded or a priori and provides for course corrections in midstream, as more interesting categories and insights appear. For this study, we used n ‐grams (sequence of up to n words) as search terms to classify, categorize and retrieve information and to increase the precision of classifying text (Bekkerman & Allan, ; Tan et al, ; Crossley & Louwerse, ). To increase the precision, the analyst iteratively refines the selection of words to become as relevant as possible to the class of concepts being classified (Stryker et al, ).…”
Section: Identifying the Isd Canonsmentioning
confidence: 99%