Em primeiro lugar agradeço à melhor orientadora que um estudante poderia almejar, professora Sandra, se algum dia eu vier a orientar alguém, quero tentar ser pelo menos metade do que você foi pra mim.À minha co-orientadora não oficial, a incrível linguista e pesquisadora Magali Duran, nem consigo listar o quanto aprendi com você.À pesquisadora Vanessa Magalhães e à Embrapa, pelo apoio, motivação e acompanhamento em todas as fases deste trabalho.Aos demais co-autores dos artigos desta tese. Se tive algum mérito no resultado final, certamente foi o de servir de motivo para reunir tanta gente boa em torno do objetivo de me ajudar. Em especial às professoras Carol, Érica, Elis, Katerina e Teresa; aos professores Gustavo, Denis e Renê; e aos colegas João, Nathan e Edresson.À minha esposa, bailarina e professora, que há 16 anos vem me apoiando nessa jornada.Ao meu finado pai, que silenciosamente me ensinou que ser honesto é mais importante que saber ler e escrever. À minha mãe, que despertou em mim o gosto pela leitura por meio do exemplo, e sempre me incentivou a continuar estudando.Ao meu irmão e guia, que me trouxe para a área da computação e agora tenta me levar para a cafeicultura nas montanhas mineiras.Ao professor Thiago Pardo, por ensinar PLN com um entusiasmo contagiante.Ao professor Thiago e às professoras Graça Nunes, Carol Scarton, Maria José Finatto e Lilian Hubner pelas preciosas dicas nas bancas de qualificação de mestrado e doutorado.Às professoras Lilian Hubner e Renata Vieira e ao professor Marcelo Finger pela cuidadosa revisão e avaliação da tese na banca final.
Sentence complexity assessment is a relatively new task in Natural Language Processing. One of its aims is to highlight in a text which sentences are more complex to support the simplification of contents for a target audience (e.g., children, cognitively impaired users, non-native speakers and low-literacy readers ). This task is evaluated using datasets of pairs of aligned sentences including the complex and simple version of the same sentence. For Brazilian Portuguese, the task was addressed by (Leal et al., 2018), who set up the first dataset to evaluate the task in this language, reaching 87.8% of accuracy with linguistic features. The present work advances these results, using models inspired by (Gonzalez-Garduño and Søgaard, 2018), which hold the state-of-the-art for the English language, with multi-task learning and eyetracking measures. First-Pass Duration, Total Regression Duration and Total Fixation Duration were used in two moments; first to select a subset of linguistic features and then as an auxiliary task in the multi-task and sequential learning models. The best model proposed here reaches the new state-of-the-art for Portuguese with 97.5% accuracy 1 , an increase of almost 10 points compared to the best previous results, in addition to proposing improvements in the public dataset after analysing the errors of our best model. 1 Accuracy in our task is how close the model is to the true value, when assessing whether a given sentence is simple or complex, in a 10-fold cross-validation test.
This article presents RastrOS, a new eye-tracking corpus of eye movement data from university students during silent reading of paragraphs of texts in Brazilian Portuguese (BP). The article shows the potential of the corpus for natural language processing (NLP) using it to evaluate the sentence complexity prediction task in BP and it also focuses on the description of NLP resources and methods developed to create the corpus. Specifically, we present: (i) the method used to select the corpus paragraphs from large corpora, using linguistic metrics and clustering algorithms; (ii) the platform for collecting the Cloze test, which is also responsible for creating the project datasets, and (iii) the hybrid semantic similarity method, based on word embedding models and contextualised word representations, used to generate semantic predictability norms. RastrOS can be downloaded from the open science framework repository with the computational infrastructure mentioned above. Datasets with predictability norms of 393 participants and eye-tracking data of 37 participants are available in the OSF repository for this work (
https://osf.io/9jxg3/
).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.