This article has two limitations not mentioned in the title. First, it deals with modern English, thus ignoring a considerable body of computational research on texts of an earlier period. Second, it concentrates on research, a major purpose of which is to provide better descriptive information on the English language, giving less attention to computational linguistic research which uses English language data or input for other purposes.
The paper describes the development of software for automatic grammatical ana]ysi$ of unl~'Ui~, unedited English text at the Unit for Compm= Research on the Ev~li~h Language (UCREL) at the Univet~ of Lancaster. The work is ~n'nmtly funded by IBM and carried out in collaboration with colleagues at IBM UK (W'~) and IBM Yorktown Heights. The paper will focus on the lexicon component of the word raging system, the UCREL grammar, the datal~zlks of parsed sentences, and the tools that have been written to support developmem of these comlm~ems. ~ wozk has applications to speech technology, sl~lfing conectim, end other areas of natural lmlguage pngessil~ ~y, our goal is to provide a language model using transin'ca statistics to di.~.nbigu~ al.:mative 1~ for a speech .:a~nicim device. Text CorporaHistorically, the use of text corpora to provide mnp/ncal data for tes~g gramm.~e.al theories has been regarded as important to varying degn~es by philologists and linguists of differing pe~msions. The use of co~us citations in ~-~,~ma~ and dictionaries pre~t~ electronic da~a processing (Brown. 1984: 34). While most of the generative 8r~-,-a,iam of the 60S and 70S ignored corpus ant,,: the inc~tsed power Of the new t~mlogy ,wenlw.l~ points the way to new applications of computerized text cmlxEa in dictiona~ makln~_: style checking and speech w, cognition. Compmer corpora present the computational linguist with the diversity and complexity of real language which is more challenging for testing language models than intuitively derived examples. Ultimately grammatl must be judged by their ability to contend with the teal facts of language and not just basic constructs extrapolated by grammm/ans. Word TaggingThe system devised for automatic word tagging or part of speech selection for processing nmn/ng Enfli~ text, known as the Constituent-Likelihood Automatic Word-tagging System (CLAWS) (Garside et aL, 1987) serves as the basis for the current work. The word tagging system is an automated c~mponent of the probabilist/c parsing system we are curnmtly woddng on. In won/tagging, each of the rurmi.$ words in the coqms text to be processed is associated with a pre-termina/ symbol, denoting word class. In e.~enc~ the CLAWS suite can be conceplually divided imo two phases: tag assignment and tag selection.
Work at the Unit for Computer Research on the Eaglish Language at the University of Lancaster has been directed towards producing a grammatically s nnotated version of the Lancaster-Oslo/ Bergen (LOB) Corpus of written British English texts as the prel~minary stage in developing computer programs and data files for providing a grammatical analysis of-n~estricted English text. From 1981-83, a suite of PASCAL programs was devised to automatically produce a single level of grammatical description with one word tag representing the word class or part of speech of each word token in the corpus. Error analysis and subsequent modification to the system resulted in over 96 per cent of word tags being correctly assigned automatically. The remaining 3 to ~ per cent were corrected by human post-editors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.