Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Paschoal, André F. A.; Pirozelli, Paulo; Freire, Valdinei; Delgado, Karina Valdivia; Peres, Sarajane Marques; José, Marcos M.; Nakasato, Flávio; Oliveira, André Seidel; Brandão, Anarosa Alves Franco; Costa, Anna Helena Reali; Cozman, Fábio Gagliardi

doi:10.1145/3459637.3482012

Cited by 13 publications

(6 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Typical translation issues may also have contributed to the relatively low level of human baselines (as explained in the original paper of Pirá 1.0 [1]). In particular, names and nouns received different translations throughout the dataset, a behavior that directly affects quality metrics.…”

Section: Human Baselinesmentioning

confidence: 99%

“…We first summarize, in Section 2.1, the features of the original Pirá dataset, referred to here as the Pirá 1.0 dataset; that dataset appeared in Ref. [1]. We then describe, in Section 2.2, the new version of the dataset, Pirá 2.0.…”

Section: The Datasetmentioning

confidence: 99%

“…In this paper we deal with Pirá [1], a recently created dataset about the ocean, Brazilian coast, and climate change. In short, Pirá is a reading comprehension dataset, containing texts, questions and answers in two languages (Portuguese and English), manual paraphrases, and human evaluations.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset on the ocean, the Brazilian coast, and climate change

Pirozelli

Igor

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Pirá is a recently developed reading comprehension dataset focused on the ocean, the Brazilian coast, and climate change. No detailed set of baselines has been built with this dataset yet, something that certainly hinders its use by researchers. In this paper, we define five benchmarks over the Pirá dataset, covering machine reading comprehension, information retrieval, open question answering, answer triggering, and multiple choice question answering. As part of this effort, we have produced a curated version of the original dataset, where we fixed a number of grammar issues, repetitions and other shortcomings. Furthermore, the dataset, now called Pirá 2.0, has been extended in several new directions, so as to face the aforementioned benchmark tasks: translation of supporting texts into Portuguese, classification labels on answerability, multiple choice candidates, and automatic paraphrases of questions and answers. The results described in this paper provide a reference point for researchers working with Pirá 2.0. Our results show that Pirá 2.0 is indeed a very challenging dataset, particularly useful for testing the ability of current machine learning models in acquiring expert scientific knowledge.

show abstract

Section: Human Baselinesmentioning

confidence: 99%

Section: The Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset on the ocean, the Brazilian coast, and climate change

Pirozelli

Igor

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…To demonstrate the rationale and soundness of our approach, we applied it to abstracts of scientific papers from the Pirá dataset [Paschoal et al 2021]. The Pirá dataset is specifically designed to support the development of question answering models, but it comprises a well-structured corpus of abstracts of scientific papers, which represent a complex domain where extraction of concise knowledge is challenging.…”

Section: Introductionmentioning

confidence: 99%

A strategy for interpreting and visualizing the results of matrix-trifactorization-based coclustering algorithms

Castro,

Peres,

Freitas Junior

et al. 2023

Anais Do XX Encontro Nacional De Inteligência Artificial E Computacional (ENIAC 2023)

View full text Add to dashboard Cite

Information yielded by unsupervised learning is often hard to interpret due to the lack of defined labels. To overcome this, we propose and illustrate a strategy for interpreting and visualizing the results of coclustering algorithms based on trifactorization. Our method consists of three steps: (1) vector space visualization; (2) cluster characterization by top documents/words; and (3) cocluster characterization by comparing top words between different clusters. The latter allows exploring the resulting clusters in a way which considers the relationship between attribute cluster and data cluster for every data cluster, instead of just the data cluster with the highest association with this attribute cluster. We illustrate the use of our method for the Non-negative Block Value Decomposition on a dataset of scientific abstracts.

show abstract

“…It is also worth noting that our studies on Brazilian Portuguese NLP also resulted in relevant cooperation in related fields, such as in question answering (CAC ¸ÃO et al, 2021;PASCHOAL et al, 2021) and zero-shot text classification (ALCOFORADO et al, 2022).…”

Section: Accomplishmentsmentioning

confidence: 85%

Sumarizando múltiplos websites para a geração do Wikipédia PT-BR automaticamente.

Oliveira

View full text Add to dashboard Cite

Wikipedia is an essential free source of intelligible knowledge. Despite that, the Brazilian Portuguese portal still lacks descriptions for many subjects. To expand the Brazilian Wikipedia, we present PLSum, Portuguese Long Summarizer, a framework for generating wiki-like abstractive summaries from multiple descriptive websites. The framework has an extractive stage followed by an abstractive one. In the extractive stage, parts from documents are extracted on the topic of interest. Then in the abstractive step, fine-tuning is performed, seeking to rewrite the excerpts in a cohesive, correct, and meaningful summary. In particular, we fine-tune and compare two recent variations of the Transformer neural network for the abstractive stage, PTT5 and Longformer. In the extractive stage, we propose a new method based on clustering dense semantic representations to select the most relevant sentences. To fine-tune and evaluate the model, we created a dataset with thousands of examples, linking reference websites to Wikipedia. Our final results show that it is possible to generate meaningful abstractive summaries from Brazilian Portuguese web content. PLSum successfully applies style transfer, which is not possible with fully extractive techniques that are predominant in Brazilian literature. Finally, we also concluded that the use of dense semantic representations for the extractive stage enabled the selection of diverse sentences, making a non repetitive extractive summary.

show abstract

Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Cited by 13 publications

References 32 publications

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset on the ocean, the Brazilian coast, and climate change

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset on the ocean, the Brazilian coast, and climate change

A strategy for interpreting and visualizing the results of matrix-trifactorization-based coclustering algorithms

Sumarizando múltiplos websites para a geração do Wikipédia PT-BR automaticamente.

Contact Info

Product

Resources

About