2022
DOI: 10.48550/arxiv.2201.07281
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Abstract: Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 15 publications
0
5
0
Order By: Relevance
“…For NER word clouds, Stanza's 4-class (person, organization, location, and miscellaneous) tweet NER model was used to extract named entities. 34,35 The extracted NERs were plotted per time period per drug to show the change in people's foci over time.…”
Section: Content Analysismentioning
confidence: 99%
“…For NER word clouds, Stanza's 4-class (person, organization, location, and miscellaneous) tweet NER model was used to extract named entities. 34,35 The extracted NERs were plotted per time period per drug to show the change in people's foci over time.…”
Section: Content Analysismentioning
confidence: 99%
“…Perhaps the best solutions currently available for this task are the ArkTweet tagger [72] and the finetuned BERTweet [113]. The accuracy of these models evaluated in Tweebank2 is 94.6% [114] and 95.3% [113] respectively.…”
Section: A2 Tweet Annotationmentioning
confidence: 99%
“…Perhaps the best solutions currently available for this task are the ArkTweet tagger [72] and the finetuned BERTweet [113]. The accuracy of these models evaluated in Tweebank2 is 94.6% [114] and 95.3% [113] respectively. However, since ArkTweet introduces Twitter-specific tags (see table 2), the output is more informative from a human perspective.…”
Section: A2 Tweet Annotationmentioning
confidence: 99%
“…Researchers have used various forms of text processing technique to automatically extract and analyse documents such as business documents [20], clinical notes [21], legal documents [2], [22] and so on. NLP techniques have been used to perform text information extraction [23], named entity recognition [24], language to SQL translator [25], [26], summarisation [27], classification and examination of other textual contents such as CVs [28], invoices [20] and social media texts [29]. These NLP techniques and others have been largely used around the text content of a document, and sometimes shorttext based documents.…”
Section: Gapmentioning
confidence: 99%