Proceedings of the 21st Workshop on Biomedical Language Processing 2022
DOI: 10.18653/v1/2022.bionlp-1.19
|View full text |Cite
|
Sign up to set email alerts
|

Pretrained Biomedical Language Models for Clinical NLP in Spanish

Abstract: This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domainspecific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(18 citation statements)
references
References 12 publications
0
9
0
Order By: Relevance
“…Nowadays, there is a strong development of contextualized word embeddings that assign dynamic representations to words based on their contexts, achieving state-of-the-art performance in multiple tasks. For the clinical domain in Spanish, relevant works include (Akhtyamova et al, 2020 ; Carrino et al, 2022 ; Rojas et al, 2022 ). These contextualized word embeddings are challenging to compute and deploy in production environments due to their demanding infrastructure needs.…”
Section: Discussionmentioning
confidence: 99%
“…Nowadays, there is a strong development of contextualized word embeddings that assign dynamic representations to words based on their contexts, achieving state-of-the-art performance in multiple tasks. For the clinical domain in Spanish, relevant works include (Akhtyamova et al, 2020 ; Carrino et al, 2022 ; Rojas et al, 2022 ). These contextualized word embeddings are challenging to compute and deploy in production environments due to their demanding infrastructure needs.…”
Section: Discussionmentioning
confidence: 99%
“…Both of them are accessible through Hugging Face [30]. They are RoBERTa-based [31] models, trained with Spanish biomedical texts and fine-tuned using data from MEDDOPROF corpus [32]. MEDDOPROF is a public corpus consisting of 1,844 Spanish clinical case reports with annotations for occupations (i.e., occupations that provide a person with an income or livelihood), working status, and activities (i.e., non-remunerated professions); as well as annotations for to whom the occupation belongs, namely, patient, family member, health professional, or others.…”
Section: Methodsmentioning
confidence: 99%
“…Similarly, training models from scratch on mixed-domain data is disputed by the biomedical NLP community, especially in a low-resource setting [44]. Overall, in-domain pretraining is arguably the best option, given the availability of training data and computational resources [45][46][47].…”
Section: Pretrained Language Models For Loementioning
confidence: 99%
“…Named Entity Recognition (NER) is a foundational task for efficient information extraction. Unsurprisingly, it is known as "the most studied task in the biomedical and clinical NLP literature" [48]. Medical named entities have a more complicated structure than entities in other domains.…”
Section: Named Entity Recognition Normalization and Linking For Loementioning
confidence: 99%