Mariona Coll Ardanuy scite author profile

Mariona Coll Ardanuy

5Publications

138Citation Statements Received

70Citation Statements Given

How they've been cited

136

How they cite others

Affiliations

The Alan Turing Institute, Queen Mary University of London, University of Trier

Publications

Order By: Most citations

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Strien

Beelen

Ardanuy

et al. 2020

View full text Add to dashboard Cite

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks -sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning -using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

show abstract

Structure-based Clustering of Novels

Ardanuy¹,

Sporleder²

2014

View full text Add to dashboard Cite

To date, document clustering by genres or authors has been performed mostly by means of stylometric and content features. With the premise that novels are societies in miniature, we build social networks from novels as a strategy to quantify their plot and structure. From each social network, we extract a vector of features which characterizes the novel. We perform clustering over the vectors obtained, and the resulting groups are contrasted in terms of author and genre.

show abstract

Toponym disambiguation in historical documents using semantic and geographic features

Ardanuy¹,

Sporleder²

2017

View full text Add to dashboard Cite

Clustering of Novels Represented as Social Networks

Ardanuy

Sporleder

2015

LiLT

View full text Add to dashboard Cite

Within the field of literary analysis, there are few branches as confusing as that of genre theory. Literary criticism has failed so far to reach a consensus on what makes a genre a genre. In this paper, we examine the degree to which the character structure of a novel is indicative of the genre it belongs to. With the premise that novels are societies in miniature, we build static and dynamic social networks of characters as a strategy to represent the narrative structure of novels in a quantifiable manner. For each of the novels, we compute a vector of literary-motivated features extracted from their network representation. We perform clustering on the vectors and analyze the resulting clusters in terms of genre and authorship.

show abstract

defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data

Filgueira

Ardanuy

Colavizza

et al. 2019

View full text Add to dashboard Cite

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.