Proceedings of the 5th Workshop on Representation Learning for NLP 2020
DOI: 10.18653/v1/2020.repl4nlp-1.10
|View full text |Cite
|
Sign up to set email alerts
|

Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT

Abstract: Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs. In this paper, we verify BERT's effectiveness for document classification and investigate the extent to which BERT-level effectiveness can be obtained by different baselines, combined with knowledge distillation-a popular model compression method. The results show that BERTlevel effectiveness can be achieved by a singlelayer LSTM with at least 40× few… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
34
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 33 publications
(34 citation statements)
references
References 12 publications
0
34
0
Order By: Relevance
“…From a technical perspective, the current BERT pre-trained model has an input limit of 512 tokens. In order to process lengthy documents such as the “Warnings and Precautions” section containing hundreds to thousands of words, various solutions have been proposed, including i) text truncation and ii) text splitting combined with different pooling methods or Long Short-Term Memory networks ( Adhikari et al, 2019a , 2019b ; Sun et al, 2020 ). Such more complex model structures do not fit better the classification criteria for this study and complicate the model interpretation, as compared to a sentence classification-based model structure.…”
Section: Discussionmentioning
confidence: 99%
“…From a technical perspective, the current BERT pre-trained model has an input limit of 512 tokens. In order to process lengthy documents such as the “Warnings and Precautions” section containing hundreds to thousands of words, various solutions have been proposed, including i) text truncation and ii) text splitting combined with different pooling methods or Long Short-Term Memory networks ( Adhikari et al, 2019a , 2019b ; Sun et al, 2020 ). Such more complex model structures do not fit better the classification criteria for this study and complicate the model interpretation, as compared to a sentence classification-based model structure.…”
Section: Discussionmentioning
confidence: 99%
“…Note that SnipBERT is very different from the standard BERT concatenation approach where individual sequences (snippets) are concatenated into a single document and fed into the model (Huang et al, 2019;Devlin et al, 2018;Beltagy et al, 2020;Adhikari et al). In contrast, SnipBERT processes each short snippet individually and aggregates them in an end-to-end manner.…”
Section: Methodsmentioning
confidence: 99%
“…As our source of data, we choose the question-answer portions of U.S. congressional hearings (all in English) for several reasons: they contain political and societal controversy identifiable by crowdsourced workers, they have a strong signal of ambiguity as to the form and intent of the response, and the data is plentiful. 2 A dataset statement is in Appendix D.…”
Section: Datasetmentioning
confidence: 99%