2020
DOI: 10.48550/arxiv.2008.00364
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Survey on Text Classification: From Shallow to Deep Learning

Abstract: Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state of the art approaches from 1961 to 2020, focusing on models from shallow to deep learning. We create a taxonomy for t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
46
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 45 publications
(47 citation statements)
references
References 136 publications
0
46
0
1
Order By: Relevance
“…We estimate costs for each task in Table 3 (Sun et al, 2020a) that leverages the gradients of the gold labels w.r.t the embeddings of the input tokens to find the most informative tokens, which have the largest gradients among all positions within a sentence. Then we corrupt the selected tokens with one of the following typos: 1) Insertion; 2) Deletion; 3) Swap; 4) Mistype: Mistyping a word though keyboard, such as "oh" → "0h"; 5) Pronounce: Wrongly typing due to the close pronounce of the word, such as "egg" → "agg"; 6) Replace-W: Replace the word by the frequent human behavioural keyboard typo based on the Wikipedia statistics (Sun, 2020).…”
Section: Mea Setup and Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We estimate costs for each task in Table 3 (Sun et al, 2020a) that leverages the gradients of the gold labels w.r.t the embeddings of the input tokens to find the most informative tokens, which have the largest gradients among all positions within a sentence. Then we corrupt the selected tokens with one of the following typos: 1) Insertion; 2) Deletion; 3) Swap; 4) Mistype: Mistyping a word though keyboard, such as "oh" → "0h"; 5) Pronounce: Wrongly typing due to the close pronounce of the word, such as "egg" → "agg"; 6) Replace-W: Replace the word by the frequent human behavioural keyboard typo based on the Wikipedia statistics (Sun, 2020).…”
Section: Mea Setup and Resultsmentioning
confidence: 99%
“…To evaluate the efficacy of the proposed attacks, we select four NLP datasets covering two main tasks, i) sentiment analysis and ii) topic classification (Li et al, 2020). We use TP-US from Trustpilot Sentiment dataset (Hovy et al, 2015) and YELP dataset (Zhang et al, 2015) for sentiment analysis.…”
Section: Nlp Tasks and Datasetsmentioning
confidence: 99%
“…These days, unstructured text is everywhere, from our conversations and comments on social media to emails, websites, etc. Their processing by means of artificial intelligence (AI) techniques and Natural Language Processing (NLP) is at a high level [1], including such an important type of text as medical records. Extracting useful information from medical texts and reports automatically plays a pivotal and important role in supporting medical decision making [2,3,4].…”
Section: Introductionmentioning
confidence: 99%
“…We have used Hierarchical Attention Networks (HAN) from this family. The related works in literature have considered different datasets and algorithms thus making it difficult to have a holistic view [19]. In this work, we provide a comparative view of different families of algorithms on a range of datasets.…”
Section: Introductionmentioning
confidence: 99%