Multi-dimensional register classification using bigrams

Crossley, Scott A.; Louwerse, Max M.

doi:10.1075/ijcl.12.4.02cro

Cited by 48 publications

(14 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, there is a wealth of studies that show that n-grams can be a good diagnostic or a good discriminatory tool in many corpus-linguistic and computational-linguistic domains, for example: − lexical n-grams are used for multidimensional register classification (cf. Crossley & Louwerse 2007), the study of academic English (cf. Biber, Conrad & Cortes 2004 and Simpson-Vlach & Ellis forthcoming), the identification of junk/spam emails (Orasan & Krishnamurthy 2002), etc.…”

Section: N-grams In Today's Corpus Linguisticsmentioning

confidence: 99%

Lexical gravity across varieties of English

Gries

Mukherjee

2010

IJCL

136

View full text Add to dashboard Cite

In our earlier work on three Asian Englishes and British English, we showed how lexico-syntactic co-occurrence preferences for three argument structure constructions revealed differences between varieties that correlated well with Schneider's (2003Schneider's ( , 2007 model of evolutionary stages. Here, we turn to lexical co-occurrence preferences and investigate if and to what degree n-grams distinguish between different modes and varieties in the same components of the International Corpus of English. Our approach to n-grams differs from previous work in that we neither use raw frequencies nor (problematic) MI-values but the newly proposed measure of lexical gravity (cf. Daudaravičius & Marcinkevičienė 2004), which takes type frequencies into consideration. We show how lexical gravity can be extended to handle n-grams with n ≥ 3 and apply this method to our n-gram data; in addition, we suggest a new concept for describing the tendency of a word to occur in significant n-grams: lexical stickiness.

show abstract

Section: N-grams In Today's Corpus Linguisticsmentioning

confidence: 99%

Lexical gravity across varieties of English

Gries

Mukherjee

2010

IJCL

136

View full text Add to dashboard Cite

show abstract

“…It has been the focus of a range of corpus-based studies employing different terminologies, (e.g., pattern, collocation, colligation, multi-word units, lexical bundles, n-gram, construction, among others), but all emphasise the inter-dependence of form and meaning (Biber, 2006;Biber et al, 1999Hoey, 2005;Hunston and Francis, 2000;Hyland, 2008;and Goldberg, 2006). Crossley and Louwerse (2007) classify registers using the frequency of bigrams shared among nine spoken and two written corpora, the findings of which demonstrate that the phrasal units and grammatical constructions can function as a powerful approach to MD analysis. Indeed, as Gries et al (2011) observe, 'a pure n-gram-based approach can be used as an initial, computationally cheap, way of classifying corpus registers that produces useful results.…”

Section: Selection Of Linguistic Features For Factor Analysismentioning

confidence: 99%

“…A total of 141 linguistic features are used in the study of world Englishes, language variation across different registers and world English varieties. Crossley and Louwerse (2007) introduce bigrams into the MD analytical framework, demonstrating its strength for classifying spoken and written registers. Since we did not know, before the model was established, which linguistic features would be sufficiently strong and significant, we followed Biber's (1995) suggestion that as many features as possible should be included, initially, and at the lowest possible level of groupings.…”

Section: An Introduction To MD Analysismentioning

confidence: 99%

A multi-dimensional contrastive study of English abstracts by native and non-native writers

Cao¹,

Xiao²

2013

Corpora

View full text Add to dashboard Cite

This article takes the multi-dimensional (MD) analysis approach to explore the textual variations between native and non-native English abstracts on the basis of a balanced corpus containing English abstracts written by native English and native Chinese writers from twelve academic disciplines. A total of 47 out of 163 linguistic features are retained after factor analysis, which underlies a seven-dimension framework representing seven communicative functions. The results show that the two types of abstracts demonstrate significant differences in five out of the seven dimensions. To be more specific, native English writers display a more active involvement and commitment in presenting their ideas than Chinese writers. They also use intensifying devices more frequently. In contrast, Chinese writers show stronger preferences for conceptual elaboration, passives and abstract noun phrases no matter whether the two types of data are examined as a whole or whether variations across disciplines are taken into account. The results are discussed in relation to the possible reasons and suggestions for English abstract writing in China. Methodologically, this study innovatively expands on Biber's (1988) MD analytical framework by integrating colligation in addition to grammatical and semantic features.

show abstract

“…It allows for any form of coding, grounded or a priori and provides for course corrections in midstream, as more interesting categories and insights appear. For this study, we used n ‐grams (sequence of up to n words) as search terms to classify, categorize and retrieve information and to increase the precision of classifying text (Bekkerman & Allan, ; Tan et al, ; Crossley & Louwerse, ). To increase the precision, the analyst iteratively refines the selection of words to become as relevant as possible to the class of concepts being classified (Stryker et al, ).…”

Section: Identifying the Isd Canonsmentioning

confidence: 99%

Distilling a body of knowledge for information systems development

Hassan

Mathiassen

2017

Information Systems Journal

View full text Add to dashboard Cite

Abstract. As a contribution towards consolidating the information systems (IS) field, we offer a systematic method for distilling a canonical body of knowledge (BOK) for information systems development (ISD), an area that historically accounts for as much as half of all IS research. Based on an integrative synthesis of the literature, we present a map of the most significant ISD research, uncover gaps in its canons and suggest fruitful lines of inquiry for new research. Our review combines citation analysis, which identifies the field's evidence of cumulative tradition, with computer-aided textual analysis, a hermeneutically guided method that organizes the fragmented corpus of ISD literature into coherent knowledge areas. From a pool of over 6500 articles published in the IS Senior Scholars' Basket of Journals, we find 940 IS citation classics, and from that list, 466 ISD articles that offer canonical ISD knowledge distinctive to IS and complementary to other disciplines such as software engineering and project management. From this study, we offer two contributions: (1) a justification for an ISDBOK grounded in the theory of practice and professionalism, and (2) a canonical map of disciplinary ISD knowledge with areas that have demonstrated cumulative tradition and others that require the attention of IS scholars.

show abstract

Multi-dimensional register classification using bigrams

Cited by 48 publications

References 14 publications

Lexical gravity across varieties of English

Lexical gravity across varieties of English

A multi-dimensional contrastive study of English abstracts by native and non-native writers

Distilling a body of knowledge for information systems development

Contact Info

Product

Resources

About