2014
DOI: 10.3366/cor.2014.0055
|View full text |Cite
|
Sign up to set email alerts
|

Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus

Abstract: The abundance of language data now available in digital form and the rise of particular language varieties used for digital communication means that issues of non-standard spelling and spelling errors are likely to become a more prominent issue for compilers of such corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of emails about heath c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 14 publications
0
3
0
Order By: Relevance
“…Many social media data have notso-ideal text characteristics such as the presence of misspellings, acronyms, and colloquial terms, apart from the mixed usage of different languages. While top key terms may be identified reliably despite spelling errors (Smith et al, 2014), more work is needed to establish a similar accuracy for determining and clustering keywords for multilingual data where spelling variation is higher.…”
Section: Discussionmentioning
confidence: 99%
“…Many social media data have notso-ideal text characteristics such as the presence of misspellings, acronyms, and colloquial terms, apart from the mixed usage of different languages. While top key terms may be identified reliably despite spelling errors (Smith et al, 2014), more work is needed to establish a similar accuracy for determining and clustering keywords for multilingual data where spelling variation is higher.…”
Section: Discussionmentioning
confidence: 99%
“…However, given the possibility of intentional misspelling, the effects of misspelling on judgments of cognitive ability, expertise, or trustworthiness may be complex, reader-and context-dependent. In particular, the quantity of misspelling on unmoderated health forums is so great, especially among adolescents, that it seriously interferes with research (Smith et al, 2014).…”
Section: Discussionmentioning
confidence: 99%
“…The software has been modified since its initial development (Rayson et al, 2005) to give the researcher more options to adapt it to their research and type of language. More recently, it has been used to normalise digital language, such as SMS (Tagg et al, 2012) and the Teenage Health Freak Corpus (Smith et al, 2014).…”
Section: Data Normalisation Processmentioning
confidence: 99%