Introduction: challenges and prospects of born-digital and digitized archives in the digital humanities

Jaillant, Lise; Aske, Katie; Goudarouli, Eirini; Kitcher, Natasha

doi:10.1007/s10502-022-09396-1

Cited by 5 publications

(1 citation statement)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, there is often a backlog of unindexed collections, meaning that staff cannot find suitable sources relating to queries, which negatively impacts the quality of research that relies on archival material. Artificial intelligence has emerged over recent years as a possible solution to this dual problem of more documents being published and having fewer staff to index them (Jaillant 2022).…”

Section: Problem Statementmentioning

confidence: 99%

Machine learning for document classification in an archive of the National Afrikaans Literary Museum and Research Centre

Brokensha,

Kotzé,

Senekal

2023

J. South Afr. Soc. Archiv.

View full text Add to dashboard Cite

https://dx.doi.org/10.4314/jsasa.v56i1.10 ISSN: 1012-2796 ©SASA 2023 Most archives were established before the digital age, where hardcopies of much smaller volumes were archived. In the information age, archives struggle to accommodate the large volumes of material produced. In addition, many archives, including in South Africa, had to contend with budget cuts that reduced the number of staff available. If digital material is not archived now, it creates the risk of gaps in the historical record in the future. In addition, with digital humanities gaining wider acceptance, large corpuses of digital material are needed, which archives could provide. This study’s aim was to investigate whether document classification using machine learning classifiers is feasible in a South African archive context, with a focus on the National Afrikaans Literary Museum and Research Centre (NALN). The researchers created and trained a document classification model and tested it for accuracy against human classifiers. It followed a basic linguistic approach to prepare specific text documents for text classification, in terms of Galloway and Roux’s (2019) six categories, namely articles, media reports, books, interviews, reviews, and dissertations and theses. The classification was done using two annotators, after which the annotated corpus was employed as training data for machine learning models. Following Rolan et al. (2018), Suominen (2019), and Connelly et al. (2020), Python libraries were used for document classifications. The researchers show that machine learning classifiers can accurately categorise documents into different types. If implemented, this means that archives can improve their collection efforts without spending more on salaries. One way of coping with the information explosion is to develop metadata generation tools, like machine learning and artificial intelligence. If metadata could be automatically generated, it would reduce the pressure on archival personnel by providing a way to handle larger volumes.

show abstract

Section: Problem Statementmentioning

confidence: 99%