GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

Costa-jussà, Marta R.; Lin, Pau Li; España-Bonet, Cristina

doi:10.48550/arxiv.1912.04778

Cited by 2 publications

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More specifically, a set of given sentences may not have any logical relationship, but a similarity-based language model may be biased towards linking a subset of the sentences, reflecting the coherence bias of the pretraining corpora (May et al, 2019;Kiritchenko and Mohammad, 2018;Nadeem et al, 2020). Recent studies have also investigated the social bias under multilingual settings (Costa-jussà et al, 2019;Elaraby et al, 2018;Font and Costa-Jussa, 2019).…”

Section: Related Workmentioning

confidence: 99%

Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning

Luo,

Glass

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Due to their similarity-based learning objectives, pretrained sentence encoders often internalize stereotypical assumptions that reflect the social biases that exist within their training corpora. In this paper, we describe several kinds of stereotypes concerning different communities that are present in popular sentence representation models, including pretrained next sentence prediction and contrastive sentence representation models. We compare such models to textual entailment models that learn language logic for a variety of downstream language understanding tasks. By comparing strong pretrained models based on text similarity with textual entailment learning, we conclude that the explicit logic learning with textual entailment can significantly reduce bias and improve the recognition of social communities, without an explicit de-biasing process. The code, model, and data associated with this work are publicly available at https: //github.com/luohongyin/ESP.git.

show abstract

Section: Related Workmentioning

confidence: 99%

Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning

Luo,

Glass

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…There is a link in the English Wikipedia article for ''Natural language processing'' to the equivalent article titled (mçAljè AllGAt AlTbyçyè, ''Natural language processing'') 8 in the Arabic Wikipedia edition. This allows us to align the Wikipedia articles at the page (i.e., document) level [61]- [63]. Wikipedia can be generally described as a mixture of noisy parallel and comparable corpora [64].…”

Section: A Wikipedia As a Comparable Corpusmentioning

confidence: 99%

A Simple Yet Robust Algorithm for Automatic Extraction of Parallel Sentences: A Case Study on Arabic-English Wikipedia Articles

Althobaiti

2022

IEEE Access

View full text Add to dashboard Cite

Parallel corpora are vital components in several applications of Natural Language Processing (NLP), particularly in machine translation. In this paper, we present a novel method for automatically creating parallel sentences from comparable corpora. The method requires a bilingual dictionary as well as an adequate word-vectorisation method. We use Arabic and English Wikipedia as a comparable corpus to apply our proposed method and construct a parallel corpus between Arabic and English. The created Arabic-English corpus consists of 105,010 parallel sentences with a total number of 4.6M words. During our study, we compared two methods of word vectorisation, word embedding and term frequency-inverse document frequency, in terms of their usefulness in computing similarities between well-formed and syntactically ill-formed sentences. We also quantitatively and qualitatively examined the parallel corpus produced by our proposed method and compared it with other available Arabic-English parallel corpora counterparts: GlobalVoices, TED, and Wiki-OPUS. We explored the main advantages and shortcomings of these corpora when used for NLP applications, such as word semantic similarity identification and Neural Machine Translation (NMT). The word semantic similarity models trained on our parallel corpus outperformed models trained on other corpora in the task of English non-similar word identification. Our parallel corpus also proved competitive when building Arabic-English NMT systems, yielding results comparable to those of the automatically created Wiki-OPUS corpus and of the manually created TED corpus, while achieving results superior to the smaller GlobalVoices corpus.INDEX TERMS Automatic creation of parallel corpus, automatic sentence alignment, deep learning, neural machine translation, transformer model, word embedding.

show abstract

GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

Cited by 2 publications

References 16 publications

Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning

Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning

A Simple Yet Robust Algorithm for Automatic Extraction of Parallel Sentences: A Case Study on Arabic-English Wikipedia Articles

Contact Info

Product

Resources

About