Annotating large text corpora is a timeconsuming effort. Although single-user annotation tools are available, web-based annotation applications allow for distributed annotation and file access from different locations. In this paper we present the webbased annotation application Serengeti for annotating anaphoric relations which will be extended for the annotation of lexical chains.
Seamless integration of various, often heterogeneous linguistic resources (in terms of their output formats) and merging of the respective annotation layers are crucial tasks for linguistic research. After a decade of concentration on the development of formats in order to structure single annotations for specific linguistic issues, a variety of specifications to store multiple annotations over the same primary data has been developed in the last years. Among these approaches three main architectures can be identified: Prolog-based architectures, XML-related approaches and graph-based models that follow the XML syntax. However, these architectures are not free of disadvantages when used in real world applications. In the Sekimo project the XML-based Sekimo Generic Format (SGF) was developed for the purpose of storing multiple annotations on the same primary data and examine relationships between elements of different annotation layers without prepended conversion. SGF is based on the design principles of graph-based approaches but makes use of the XML-inherent tree structures whenever possible to reduce processing costs. Analysing data stored in SGF can be done via standard XML-related specifications such as XPath, XSLT or XQuery and is done in our project in the linguistic application domain of anaphora resolution.
This paper presents a refined taxonomy of XML schema languages based on the work by Murata et al., 2005. It can be seen as first building block for a more elaborate formal analysis of XML and its accompanied specifications, in this case: XML schema languages such as DTD, XSD and RELAX NG.
The paper discusses two topics: firstly an approach of using multiple layers of annotation is sketched out. Regarding the XML representation this approach is similar to standoff annotation. A second topic is the use of heterogeneous linguistic resources (e.g., XML annotated documents, taggers, lexical nets) as a source for semiautomatic multi-dimensional markup to resolve typical linguistic issues, dealing with anaphora resolution as a case study. 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.