Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
The annotation of linguistic resources has long-standing traditions (see Cole et al., 2010). The other chapters of this book make clear that the production of annotated resources is a laborious, time-consuming, andexpensive task. In theory, we want to make these resources available in such a way that they can be re-used by as many scholars as possible (see Ide&Romary, 2002). However, a largevariety of annotation formatshave been developed in the previous decades, each one created for a specific research task. Consequently,the resulting resources are frequently only usable by members of the individual research projects.The goal of the present chapter is to explore the possibility of providing the research and industrial communities that commonly use spoken corpora with a set of well-documented standardised formats that allow a high re-use rate of annotated spoken resources and, as a consequence, better interoperability across tools used to produce or exploit such resources. We hope to identify standards thatcoverall possible aspects of the management workflow of spoken data, from the actual representation of raw recordings and transcriptions to high-level content-related information at a semantic or pragmatic level. Most of the challenges here are similar to those for textual resources, except for, on the one hand, the grounding relation that spoken data has to illocutionary circumstances (time, place, speakers and addressees), and, on the other hand, the specific annotation layers that correspond to speech related information (e.g. prosody), comprising multimodal aspects such as gestures.We should also not forget, as is well illustrated in this book, the importance of legacy practices in the spoken corpora community, most of them resulting from the existence of specific tools at various representation layers, ranging from basic transcription tools (Transcriber, PRAAT) to generic score-based annotation environments (TASX, Elan, CLAN/CHAT (CHILDES), EMU). By definition, these various tools do not have the same maintenance rate and capacity and it is therefore essential to think about standardised formats as offering the possibility to be embedded with existing practices. This implies that we have two basic scenarios in mind: We want to be able to project existing data into a range of standardised representations that bear as little specificity to the original format as possible but as much faithfulness as necessary; We want standardised formats to havethe capacity to be used for the development of new technical platforms, thus allowing the integration of new requirements and new features.These two general requirements both imply standards that can incorporate features and data we have not yet envisioned. To do this, the standards should provide specification or customisation mechanisms that do not hinder their abilityto improve interoperability.That said, it is clear that such a thorough set of standardscannot be fully describedina single book chapter. Moreover, we acknowledge that there is still some work to be...
The annotation of linguistic resources has long-standing traditions (see Cole et al., 2010). The other chapters of this book make clear that the production of annotated resources is a laborious, time-consuming, andexpensive task. In theory, we want to make these resources available in such a way that they can be re-used by as many scholars as possible (see Ide&Romary, 2002). However, a largevariety of annotation formatshave been developed in the previous decades, each one created for a specific research task. Consequently,the resulting resources are frequently only usable by members of the individual research projects.The goal of the present chapter is to explore the possibility of providing the research and industrial communities that commonly use spoken corpora with a set of well-documented standardised formats that allow a high re-use rate of annotated spoken resources and, as a consequence, better interoperability across tools used to produce or exploit such resources. We hope to identify standards thatcoverall possible aspects of the management workflow of spoken data, from the actual representation of raw recordings and transcriptions to high-level content-related information at a semantic or pragmatic level. Most of the challenges here are similar to those for textual resources, except for, on the one hand, the grounding relation that spoken data has to illocutionary circumstances (time, place, speakers and addressees), and, on the other hand, the specific annotation layers that correspond to speech related information (e.g. prosody), comprising multimodal aspects such as gestures.We should also not forget, as is well illustrated in this book, the importance of legacy practices in the spoken corpora community, most of them resulting from the existence of specific tools at various representation layers, ranging from basic transcription tools (Transcriber, PRAAT) to generic score-based annotation environments (TASX, Elan, CLAN/CHAT (CHILDES), EMU). By definition, these various tools do not have the same maintenance rate and capacity and it is therefore essential to think about standardised formats as offering the possibility to be embedded with existing practices. This implies that we have two basic scenarios in mind: We want to be able to project existing data into a range of standardised representations that bear as little specificity to the original format as possible but as much faithfulness as necessary; We want standardised formats to havethe capacity to be used for the development of new technical platforms, thus allowing the integration of new requirements and new features.These two general requirements both imply standards that can incorporate features and data we have not yet envisioned. To do this, the standards should provide specification or customisation mechanisms that do not hinder their abilityto improve interoperability.That said, it is clear that such a thorough set of standardscannot be fully describedina single book chapter. Moreover, we acknowledge that there is still some work to be...
This paper introduces the idea of data sharing strategy based on a conversion service, not on a sharing application, scheme, or ontology, that are dominant in proposals for language documentation. Although these three methods have been basic tactics for sharing corpora, they have a conceptual flaw in terms of descriptive linguistics. In this paper we report the results of a previous project -the LingDy project, and propose a basic concept for corpus sharing strategy to support personal diachronic data sharing. This paper is a revised version of a handout at JADH2012, so readers should be careful that this content is based on results at the time of 2012.
The present paper deals with the visualization of concurrent markup. An initial discussion of the underlying model of XML instances demonstrates that valid XML exceeds the expressive power of trees. While some challenging features of concurrent markup, like overlaps, can be captured by minimally extended trees, there are other phenomena which can be adequately expressed in XML using constructs which instantiate advanced graph structures (e.g. discontinuous elements or repetitive structures). On the basis of two representation formats for concurrent markup, XStandoff and xLMNL, two distinct approaches towards its visualization are presented. The first method has been implemented in XSLT as an SVG-based 2D visualization strategy. Although it can be shown that this first approach provides an adequate (though not optimal) solution to overlapping structures, it is not capable of illustrating enhanced graph-based phenomena like the ones mentioned above. Therefore, some remarks about possible 3D visualizations are made which show how the adding of another dimension could contribute to the appropriately expressive visualization of concurrent markup. In addition, a prototypic implementation based on XSLT and X3D is discussed as first step towards a three-dimensional illustration.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.