Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data

Lamproudis, Anastasios; Henriksson, Aron; Dalianis, Hercules

doi:10.26615/978-954-452-072-4_090

Cited by 9 publications

(3 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…4, two different PLMs are used. One-SweDeClin-BERT-that has been trained using pseudonymized pre-training data [22], and another model-SweClin-BERT-that was trained on the unaltered version of the same dataset [51]. Both models were initialized using weights from the Swedish general-domain KB-BERT model [52] and were adapted to the clinical domain by pre-training for three epochs over the Health Bank corpus.…”

Section: Clinical Bert Modelsmentioning

confidence: 99%

“…4 This study uses two different clinical BERT models created in earlier studies. SweClin-BERT is trained on a sensitive version of the Health Bank corpus [51], whereas SweDeClin-BERT is trained on a version that has been automatically pseudonymized [22]. Both models are initialized with the weights of KB-BERT [52] which is comparable to the 3.3 billion words used to train KB-BERT [3].…”

Section: Clinical Bert Modelsmentioning

confidence: 99%

See 1 more Smart Citation

End-to-end pseudonymization of fine-tuned clinical BERT models

Vakili,

Henriksson,

Dalianis

2024

BMC Med Inform Decis Mak

Self Cite

View full text Add to dashboard Cite

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.

show abstract

Section: Clinical Bert Modelsmentioning

confidence: 99%

Section: Clinical Bert Modelsmentioning

confidence: 99%

End-to-end pseudonymization of fine-tuned clinical BERT models

Vakili,

Henriksson,

Dalianis

2024

BMC Med Inform Decis Mak

Self Cite

View full text Add to dashboard Cite

show abstract

“…Since the data is from a Swedish hospital, we use a clinical BERT model for Swedish. Clinical KB-BERT [10] is a domain-adapted version of a generic language model for Swedish that has been further pre-trained on 17.8 GB of Swedish clinical text. Representations from Clinical KB-BERT are then concatenated with the structured EHR data in a fullyconnected neural network.…”

Section: B Multimodal Mortality Prediction Modelmentioning

confidence: 99%

Leveraging Clinical BERT in Multimodal Mortality Prediction Models for COVID-19

Pawar

Henriksson

Hedberg

et al. 2022

2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS)

Self Cite

View full text Add to dashboard Cite

Clinical prediction models are often based solely on the use of structured data in electronic health records, e.g. vital parameters and laboratory results, effectively ignoring potentially valuable information recorded in other modalities, such as freetext clinical notes. Here, we report on the development of a multimodal model that combines structured and unstructured data. In particular, we study how best to make use of a clinical language model in a multimodal setup for predicting 30-day all-cause mortality upon hospital admission in patients with COVID-19. We evaluate three strategies for incorporating a domain-specific clinical BERT model in multimodal prediction systems: (i) without fine-tuning, (ii) with unimodal fine-tuning, and (iii) with multimodal fine-tuning. The best-performing model leverages multimodal fine-tuning, in which the clinical BERT model is updated based also on the structured data. This multimodal mortality prediction model is shown to outperform unimodal models that are based on using either only structured data or only unstructured data. The experimental results indicate that clinical prediction models can be improved by including data in other modalities and that multimodal fine-tuning of a clinical language model is an effective strategy for incorporating information from clinical notes in multimodal prediction systems.

show abstract

On the Impact of the Vocabulary for Domain-Adaptive Pretraining of Clinical Language Models

Lamproudis

Henriksson

2023

Communications in Computer and Information Science

View full text Add to dashboard Cite

Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data

Cited by 9 publications

References 9 publications

End-to-end pseudonymization of fine-tuned clinical BERT models

End-to-end pseudonymization of fine-tuned clinical BERT models

Leveraging Clinical BERT in Multimodal Mortality Prediction Models for COVID-19

On the Impact of the Vocabulary for Domain-Adaptive Pretraining of Clinical Language Models

Contact Info

Product

Resources

About