This paper presents the second release of arrau, a multigenre corpus of anaphoric information created over 10 years to provide data for the next generation of coreference/anaphora resolution systems combining different types of linguistic and world knowledge with advanced discourse modeling supporting rich linguistic annotations. The distinguishing features of arrau include the following: treating all NPs as markables, including non-referring NPs, and annotating their (non-) referentiality status; distinguishing between several categories of non-referentiality and annotating non-anaphoric mentions; thorough annotation of markable boundaries (minimal/maximal spans, discontinuous markables); annotating a variety of mention attributes, ranging from morphosyntactic parameters to semantic category; annotating the genericity status of mentions; annotating a wide range of anaphoric relations, including bridging relations and discourse deixis; and, finally, annotating anaphoric ambiguity. The current version of the dataset contains 350K tokens and is publicly available from LDC. In this paper, we discuss in detail all the distinguishing features of the corpus, so far only partially presented in a number of conference and workshop papers, and we also discuss the development between the first release of arrau in 2008 and this second one.
No abstract
Our goal is to improve the contextual appropriateness of spoken output in a dialogue system. We explore the use of the information state to determine the information structure of system utterances. We concentrate on the realization of information structure by intonation. We present the results of evaluating the contextual appropriateness of varied system output produced with a text-to-speech synthesis system that supports intonation annotation.
The LUNA corpus is a multi-lingual, multidomain spoken dialogue corpus currently under development that will be used to develop a robust natural spoken language understanding toolkit for multilingual dialogue services. The LUNA corpus will be annotated at multiple levels to include annotations of syntactic, semantic, and discourse information; specialized annotation tools will be used for the annotation at each of these levels. In order to synchronize these multiple layers of annotation, the PAULA standoff exchange format will be used. In this paper, we present the corpus and its PAULA-based architecture. 1
Purpose This paper aims to describe the European Holocaust Research Infrastructure (EHRI) project's ongoing efforts to virtually integrate trans-national archival sources via the reconstruction of collection provenance as it relates to copy collections (material copied from one archive to another) and the co-referencing of subject and authority terms across material held by distinct institutions. Design/methodology/approach This paper is a case study of approximately 6,000 words length. The authors describe the scope of the problem of archival fragmentation from both cultural and technical perspectives, with particular focus on Holocaust-related material, and describe, with graph-based visualisations, two ways in which EHRI seeks to better integrate information about fragmented material. Findings As a case study, the principal contributions of this paper include reports on our experience with extracting provenance-based connections between archival descriptions from encoded finding aids and the challenges of co-referencing access points in the absence of domain-specific controlled vocabularies. Originality/value Record linking in general is an important technique in computational approaches to humanities research and one that has rightly received significant attention from scholars. In the context of historical archives, however, the material itself is in most cases not digitised, meaning that computational attempts at linking must rely on finding aids which constitute much fewer rich data sources. The EHRI project’s work in this area is therefore quite pioneering and has implications for archival integration on a larger scale, where the disruptive potential of Linked Open Data is most obvious.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.