Why do some new words manage to enter the lexicon and stay there while others drop out of use and are neither used nor heard anymore? Of interest to both lay people and linguists, this question has not been answered in an empirically convincing manner to date, mainly because systematic methods have not yet been found for spotting new words as soon as possible after their first occurrence and monitoring their early development and spread as exhaustively as possible. In this paper we present a new and improved tool which is designed to accomplish precisely these tasks when applied to material from the Internet. Following a brief review of existing tools for retrieving linguistic data from the Web (Section 2), we will introduce in some detail a tailor-made webcrawler, the so-called NeoCrawler, which identifies and retrieves neologisms from the Internet and stores data necessary for the systematic monitoring of their early development with regard to form and meaning as well as spread (Section 3). Following this description, we will present a case study discussing the results of an analysis of the neologism detweet with regard to its di¤usion, institutionalization, lexicalization and lexical networkformation (Section 4). The study indicates that the NeoCrawler can indeed be applied fruitfully in the study of ongoing processes relating to how the meanings and forms of new words are negotiated in the speech community, how words spread in the early stages of their life cycles and how they begin to establish themselves in lexical and semantic networks. (V9 8/9/11 18:34) WDG (155mmÂ230mm) TimesNRMT 1317 Allan pp. 59-96 1317 Allan_04_Kerremans (p. 61) The NeoCrawler 61 6. RSS and Atom feeds are tools that enable users to update, publish and exchange web content easily. They contain basic information about the content, such as title, link, description and publication date in XML format. GlossaNet 2 uses this link to access and download the page into the corpus. 7. To our current knowledge, the LSE has not been realized (yet).
No abstract
Lexical innovation is omnipresent and constantly at work. Studies aiming to understand the process of lexical innovation and the subsequent diffusion of neologisms therefore benefit from systematic methods of neologism identification. Retrieval procedures in the past have largely consisted of manual activities of participant observations and close reading. Recently, attempts have been made at designing automatized identification procedures, assisted by state-of-the-art natural language processing techniques and tools. Beginning with a discussion of the most commonly used neologism detection methods and applications in linguistics, the present paper will describe a semi-automatic approach to identifying new words on the web, the NeoCrawler’s Discoverer, which has been developed as part of a project on the incipient diffusion of lexical innovations. The Discoverer daily processes large batches of online text in English and automatically identifies unknown grapheme sequences as potential neologism candidates by means of a dictionary matching procedure, in which the individual tokens are matched against a very large dictionary. These potential neologisms subsequently are presented to the user for manual evaluation of their neologism status. Finally, candidates are added to the NeoCrawler’s database for continuous close monitoring of their development in the online speech community. We argue that the use of dictionary matching in neologism identification offers an efficient method to semi-automatically extract potential instances of lexical innovation with high precision and high recall when compared to previous approaches.
This paper presents the NeoCrawler – a tailor-made webcrawler, which identifies and retrieves neologisms from the Internet and systematically monitors the use of detected neologisms on the web by means of weekly searches. It enables researchers to use the web as a corpus in order to investigate the dynamics of lexical innovation on a large-scale and systematic basis. The NeoCrawler represents an innovative web-mining tool which opens up new opportunities for linguists to tackle a number of unresolved and under-researched issues in the field of lexical innovation. This paper presents the design as well as the most important characteristics of two modules, the Discoverer and the Observer, with regard to the usage-based study of lexical innovation and diffusion.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.