Proceedings of the 30th ACM International Conference on Information &Amp; Knowledge Management 2021
DOI: 10.1145/3459637.3482012
|View full text |Cite
|
Sign up to set email alerts
|

Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Abstract: Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pirá dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pirá is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more im… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 32 publications
0
6
0
Order By: Relevance
“…Typical translation issues may also have contributed to the relatively low level of human baselines (as explained in the original paper of Pirá 1.0 [1]). In particular, names and nouns received different translations throughout the dataset, a behavior that directly affects quality metrics.…”
Section: Human Baselinesmentioning
confidence: 99%
See 2 more Smart Citations
“…Typical translation issues may also have contributed to the relatively low level of human baselines (as explained in the original paper of Pirá 1.0 [1]). In particular, names and nouns received different translations throughout the dataset, a behavior that directly affects quality metrics.…”
Section: Human Baselinesmentioning
confidence: 99%
“…We first summarize, in Section 2.1, the features of the original Pirá dataset, referred to here as the Pirá 1.0 dataset; that dataset appeared in Ref. [1]. We then describe, in Section 2.2, the new version of the dataset, Pirá 2.0.…”
Section: The Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…To demonstrate the rationale and soundness of our approach, we applied it to abstracts of scientific papers from the Pirá dataset [Paschoal et al 2021]. The Pirá dataset is specifically designed to support the development of question answering models, but it comprises a well-structured corpus of abstracts of scientific papers, which represent a complex domain where extraction of concise knowledge is challenging.…”
Section: Introductionmentioning
confidence: 99%
“…It is also worth noting that our studies on Brazilian Portuguese NLP also resulted in relevant cooperation in related fields, such as in question answering (CAC ¸ÃO et al, 2021;PASCHOAL et al, 2021) and zero-shot text classification (ALCOFORADO et al, 2022).…”
Section: Accomplishmentsmentioning
confidence: 85%