2019
DOI: 10.48550/arxiv.1912.04778
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

Marta R. Costa-jussà,
Pau Li Lin,
Cristina España-Bonet

Abstract: We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a h… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…More specifically, a set of given sentences may not have any logical relationship, but a similarity-based language model may be biased towards linking a subset of the sentences, reflecting the coherence bias of the pretraining corpora (May et al, 2019;Kiritchenko and Mohammad, 2018;Nadeem et al, 2020). Recent studies have also investigated the social bias under multilingual settings (Costa-jussà et al, 2019;Elaraby et al, 2018;Font and Costa-Jussa, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…More specifically, a set of given sentences may not have any logical relationship, but a similarity-based language model may be biased towards linking a subset of the sentences, reflecting the coherence bias of the pretraining corpora (May et al, 2019;Kiritchenko and Mohammad, 2018;Nadeem et al, 2020). Recent studies have also investigated the social bias under multilingual settings (Costa-jussà et al, 2019;Elaraby et al, 2018;Font and Costa-Jussa, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…There is a link in the English Wikipedia article for ''Natural language processing'' to the equivalent article titled (mçAljè AllGAt AlTbyçyè, ''Natural language processing'') 8 in the Arabic Wikipedia edition. This allows us to align the Wikipedia articles at the page (i.e., document) level [61]- [63]. Wikipedia can be generally described as a mixture of noisy parallel and comparable corpora [64].…”
Section: A Wikipedia As a Comparable Corpusmentioning
confidence: 99%