This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLab for data-driven research at the National Library of Sweden (KB). Building on recent efforts to create transformer-based BERT models for languages other than English, we explain how we used KB's collections to create and train a new language-specific BERT model for Swedish. We also present the results of our model in comparison with existing models-chiefly that produced by the Swedish Public Employment Service, Arbetsförmedlingen, and Google's multilingual M-BERT-where we demonstrate that KB-BERT outperforms these in a range of NLP tasks from named entity recognition (NER) to part-of-speech tagging (POS). Our discussion highlights the difficulties that continue to exist given the lack of training data and testbeds for smaller languages like Swedish. We release our model for further exploration and research here: https://github.com/Kungbib/swedish-bert-models.
BackgroundThe WHO definition of trachomatous trichiasis (TT) is “at least one eyelash touching the globe, or evidence of recent epilation of in-turned eyelashes”, reflecting the fact that epilation is used as a self-management tool for TT. In Fiji’s Western Division, a high TT prevalence (8.7% in those aged ≥15 years) was reported in a 2012 survey, yet a 2013 survey found no TT and Fijian ophthalmologists rarely see TT cases. Local anecdote suggests that eyelash epilation is a common behaviour, even in the absence of trichiasis. Epilators may have been identified as TT cases in previous surveys.MethodsWe used a preliminary focus group to design an interview questionnaire, and subsequently conducted a population-based prevalence survey to estimate the prevalence of epilation in the absence of trichiasis, and factors associated with this behaviour, in the Western Division of Fiji.ResultsWe sampled 695 individuals aged ≥15 years from a total of 457 households in 23 villages. 125 participants (18%) reported epilating their eyelashes at least once within the past year. Photographs were obtained of the eyes of 121/125 (97%) individuals who epilated, and subsequent analysis by an experienced trachoma grader found no cases of trachomatous conjunctival scarring or trichiasis. The age- and sex- adjusted prevalence of epilation in those aged ≥15 years was 8.6% (95% CI 5.7–11.3%). iTaukei ethnicity, female gender, and a higher frequency of drinking kava root were independently associated with epilation.ConclusionEpilation occurs in this population in the absence of trichiasis, with sufficient frequency to have markedly inflated previous estimates of local TT prevalence. Individuals with epilated eyelashes should be confirmed as having epilated in-turned eyelashes in an eye with scarring of the conjunctiva before being counted as cases of TT.
This article provides an account of the making of KBLab, the data lab at the National Library of Sweden (KB). The first part of the article offers an evaluative discussion of the work involved in establishing KBLab as both a physical and a digital site for researchers to use KB’s digital collections at previously unimaginable scales. Beyond explaining how the lab aligns with KB’s broader mission as a national library, we also elaborate upon the design of the technical setup and the processes of research coordination that the operation of a library lab presumes. The second part discusses how KBLab has deployed the library’s collections as data to produce high quality Swedish AI models, which constitute a significant new form of digital research infrastructure. We situate this development work in the context of uneven AI coverage for smaller languages, and consider how the lab’s models have contributed to the making of important AI infrastructure for the Swedish language. The conclusion raises the possibilities and challenges involved in continuing the type of library-based AI development we have initiated at KBLab.
How can novel AI techniques be made and put to use in the library? Combining methods from data and library science, this article focuses on Natural Language Processing technologies, especially in national libraries. It explains how the National Library of Sweden's collections enabled the development of a new BERT language model for Swedish. It also outlines specific use cases for the model in the context of academic libraries, detailing strategies for how such a model could make digital collections available for new forms of research, from automated classification to enhanced searchability and improved OCR cohesion. Highlighting the potential for cross-fertilizing AI with libraries, the conclusion suggests that while AI may transform the workings of the library, libraries can also play a key role in the future development of AI.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.