2021
DOI: 10.1007/978-3-030-88942-5_18
|View full text |Cite
|
Sign up to set email alerts
|

A Sentence-Level Hierarchical BERT Model for Document Classification with Limited Labelled Data

Abstract: Training deep learning models with limited labelled data is an attractive scenario for many NLP tasks, including document classification. While with the recent emergence of BERT, deep learning language models can achieve reasonably good performance in document classification with few labelled instances, there is a lack of evidence in the utility of applying BERT-like models on long document classification. This work introduces a long-text-specific model -the Hierarchical BERT Model (HBM) -that learns sentence-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 14 publications
(10 citation statements)
references
References 17 publications
0
10
0
Order By: Relevance
“…Our approach is to leverage domain knowledge in conjunction with a state-of-the-art H-BERT architecture [24]. We use a Python implementation of the ADF constructed specifically for Article 6 of the ECHR [18] to provide intermediate classifications of base-level factors, and train an independent H-BERT model for each base-level factor.…”
Section: Hybrid Adf/h-bert Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Our approach is to leverage domain knowledge in conjunction with a state-of-the-art H-BERT architecture [24]. We use a Python implementation of the ADF constructed specifically for Article 6 of the ECHR [18] to provide intermediate classifications of base-level factors, and train an independent H-BERT model for each base-level factor.…”
Section: Hybrid Adf/h-bert Methodsmentioning
confidence: 99%
“…The scarcity of the data is in stark contrast to the relatively vast data sets that are usually employed for NLP tasks. We focus on two classification approaches, a state-ofthe-art hierarchical BERT approach developed specifically for small data sets which we refer to as H-BERT [24], and our hybrid system which uses the aforementioned H-BERT architecture in conjunction with the ADF layer as outlined in section 4. Both approaches use the same fact-level pre-trained RoBERTa model encodings using 512 tokens, and both use 256 tokens for document BERT model encoding.…”
Section: Data Set and Implementation Detailsmentioning
confidence: 99%
“…The reason for using JSD divergence instead of KL divergence is that there may be significant differences between the current policy and the past policy, making the calculation of KL divergence difficult or even impossible. JSD divergence effectively alleviates this problem [27]. If all the oversamples in the sampling batch match the distribution under the current policy, then ρ = 0; when the oversamples match the distribution under the current policy to some extent, then ρ ∈ (0, ∞ ).…”
Section: Off‐policy Correction Algorithmmentioning
confidence: 99%
“…For instance, Kulesza et al (2010) suggests that when a model is trained with a small subset of labelled data, it is prone to exploiting spurious patterns leading to poor generalisability that is evident in the performance decay in outof-distribution (OOD) datasets. In spite of these issues, training deep neural networks using few labelled examples is a compelling scenario since unlabelled data may be abundant but labelled data is expensive to obtain in real-world applications (Lu and MacNamee, 2020;Lu et al, 2021).…”
Section: Introductionmentioning
confidence: 99%