2022
DOI: 10.23962/ajic.i30.13906
|View full text |Cite
|
Sign up to set email alerts
|

A word embedding trained on South African news data

Abstract: This article presents results from a study that developed and tested a word embedding trained on a dataset of South African news articles. A word embedding is an algorithm-generated word representation that can be used to analyse the corpus of words that the embedding is trained on. The embedding on which this article is based was generated using the Word2Vec algorithm, which was trained on a dataset of 1.3 million African news articles published between January 2018 and March 2021, containing a vocabulary of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 32 publications
0
2
0
Order By: Relevance
“…The effects of these sources unreliably can be tested using different bootstrapping methods. One such test involves subsampling a proportion of the data (typically 90%), and training embedding models on each dataset (Antoniak & Mimno, 2018 ; Mafunda et al, forthcoming ). With large data corpora, the space and processing demand of the method limit the number of bootstrapped embeddings that can be trained.…”
Section: Reliability and Validitymentioning
confidence: 99%
See 1 more Smart Citation
“…The effects of these sources unreliably can be tested using different bootstrapping methods. One such test involves subsampling a proportion of the data (typically 90%), and training embedding models on each dataset (Antoniak & Mimno, 2018 ; Mafunda et al, forthcoming ). With large data corpora, the space and processing demand of the method limit the number of bootstrapped embeddings that can be trained.…”
Section: Reliability and Validitymentioning
confidence: 99%
“…This subsampling bootstrapping method produces more robust bias estimates and allows researchers to judge how features of the embedding depend on the particulars of the data. In addition, design decisions (e.g., are words lemmatized or joined to n‐grams in data pre‐processing, how large is the window of the prediction task, how long do we train for and how large is the space of the embedding) can be varied to identify economical and reliable methods (Mafunda et al, forthcoming ). Generally, bootstrapping has shown bias estimates to be remarkably stable and precise (cf.…”
Section: Reliability and Validitymentioning
confidence: 99%