2023
DOI: 10.1007/s44163-023-00050-y
|View full text |Cite
|
Sign up to set email alerts
|

Automated occupation coding with hierarchical features: a data-centric approach to classification with pre-trained language models

Abstract: Occupation coding is the classification of information on occupation that is collected in the context of demographic variables. Occupation coding is an important, but a tedious task for researchers in social science and official statistics that calls for automation. Due to the complexity of the task, currently, researchers carry out hand-coding or computer-assisted coding. However, we argue that, with the rise of transformer-based language models, hand-coding can be displaced by models, such as BERT or GPT3. H… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 23 publications
0
1
0
Order By: Relevance
“…They highlight the importance of context-based data preparation and demonstrate that improving data quality has a significant impact on prediction accuracy. Safikhani et al [111] argue that transformer-based language models can enhance automated occupation coding. By fine-tuning BERT and GPT3 on pre-labeled data and incorporating job titles and task descriptions, they achieve a 15.72 percentage point performance increase compared to existing methods.…”
Section: Related Workmentioning
confidence: 99%
“…They highlight the importance of context-based data preparation and demonstrate that improving data quality has a significant impact on prediction accuracy. Safikhani et al [111] argue that transformer-based language models can enhance automated occupation coding. By fine-tuning BERT and GPT3 on pre-labeled data and incorporating job titles and task descriptions, they achieve a 15.72 percentage point performance increase compared to existing methods.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, several research articles have proposed DCAI methods in different domains. These include data-centric defense (DCD) for mitigating model inversion attacks [10], crop disease identification in agriculture [11], unsupervised anomaly detection in industrial production [12], transformer-based language models (bidirectional encoder representations from transformers (BERT) and GPT-3 for automated occupation coding [13], enhancing deep neural network (DNN) model robustness [14], a comprehensive community library for biomedical natural language processing (NLP) data sets [15], automatic surgical phase estimation [16], rapid nanoparticle energy prediction [17], identifying incongruous regions in data [18], a multidomain benchmark for Table 1. Brief summary of related works in DCAI.…”
Section: Literature Reviewmentioning
confidence: 99%