This paper compares the use of recurrent word-combinations (n-grams) in texts produced by Norwegian learners of English and native speakers of English in two academic disciplines, namely linguistics and business. The study explores the extent to which the same n-grams are used by learners and native speakers in the two disciplines. Using an adapted version of Moon's (1998) functional framework, we map the functions of the n-grams, distinguishing between three major functions: ideational/informational, interpersonal and textual. The ngrams are extracted from the VESPA and BAWE corpora, representing learner and native language, respectively.The data reveal a complex picture. Informational n-grams are by far the most frequent type and they seem to be not only discipline-specific, but also topic-specific. There are more n-grams with an interpersonal function (evaluative and modalizing) in the linguistics than in the business discipline. Frequencies of n-grams with a textual/organizational function are more similar across the material. However, there is relatively little overlap in the use of individual n-grams with interpersonal and textual functions across the L1 groups. There is a higher degree of similarity between learners and native speakers in the linguistics discipline than in the business discipline. On the other hand, there is some similarity across disciplines within L1 groups as regards interpersonal and textual n-grams.
The information contained in a document is only partly represented by the wording of the text; in addition, features of formatting and layout can be combined to lend specific functionality to chunks of text (e.g., section headings, highlighting, enumeration through list formatting, etc.). Such functional features, although based on the ‘objective’ typographical surface of the document, are often inconsistently realised and encoded only implicitly, i.e., they depend on deciphering by a competent reader. They are characteristic of documents produced with standard text-processing tools. We discuss the representation of such information with reference to the British Academic Written English (BAWE) corpus of student writing, currently under construction at the universities of Warwick, Reading and Oxford Brookes. Assignments are usually submitted to the corpus as Microsoft Word documents and make heavy use of surface-based functional features. As the documents are to be transformed into XML-encoded corpus files, this information can only be preserved through explicit annotation, based on interpretation. We present a discussion of the choices made in the BAWE corpus and the practical requirements for a tagging interface.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.