Linguistic Modeling of Information and Markup Languages

Witt, Andreas; Metzing, Dieter

doi:10.1007/978-90-481-3331-4

Cited by 3 publications

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Data Formats for Phonological Corpora

Romary¹,

Witt²

2013

Oxford Handbooks Online

View full text Add to dashboard Cite

The annotation of linguistic resources has long-standing traditions (see Cole et al., 2010). The other chapters of this book make clear that the production of annotated resources is a laborious, time-consuming, andexpensive task. In theory, we want to make these resources available in such a way that they can be re-used by as many scholars as possible (see Ide&Romary, 2002). However, a largevariety of annotation formatshave been developed in the previous decades, each one created for a specific research task. Consequently,the resulting resources are frequently only usable by members of the individual research projects.The goal of the present chapter is to explore the possibility of providing the research and industrial communities that commonly use spoken corpora with a set of well-documented standardised formats that allow a high re-use rate of annotated spoken resources and, as a consequence, better interoperability across tools used to produce or exploit such resources. We hope to identify standards thatcoverall possible aspects of the management workflow of spoken data, from the actual representation of raw recordings and transcriptions to high-level content-related information at a semantic or pragmatic level. Most of the challenges here are similar to those for textual resources, except for, on the one hand, the grounding relation that spoken data has to illocutionary circumstances (time, place, speakers and addressees), and, on the other hand, the specific annotation layers that correspond to speech related information (e.g. prosody), comprising multimodal aspects such as gestures.We should also not forget, as is well illustrated in this book, the importance of legacy practices in the spoken corpora community, most of them resulting from the existence of specific tools at various representation layers, ranging from basic transcription tools (Transcriber, PRAAT) to generic score-based annotation environments (TASX, Elan, CLAN/CHAT (CHILDES), EMU). By definition, these various tools do not have the same maintenance rate and capacity and it is therefore essential to think about standardised formats as offering the possibility to be embedded with existing practices. This implies that we have two basic scenarios in mind: We want to be able to project existing data into a range of standardised representations that bear as little specificity to the original format as possible but as much faithfulness as necessary; We want standardised formats to havethe capacity to be used for the development of new technical platforms, thus allowing the integration of new requirements and new features.These two general requirements both imply standards that can incorporate features and data we have not yet envisioned. To do this, the standards should provide specification or customisation mechanisms that do not hinder their abilityto improve interoperability.That said, it is clear that such a thorough set of standardscannot be fully describedina single book chapter. Moreover, we acknowledge that there is still some work to be...

show abstract

Data Formats for Phonological Corpora

Romary¹,

Witt²

2013

Oxford Handbooks Online

View full text Add to dashboard Cite

show abstract

Corpus Sharing Strategy for Descriptive Linguistics

Ohya

2015

JJADH

View full text Add to dashboard Cite

This paper introduces the idea of data sharing strategy based on a conversion service, not on a sharing application, scheme, or ontology, that are dominant in proposals for language documentation. Although these three methods have been basic tactics for sharing corpora, they have a conceptual flaw in terms of descriptive linguistics. In this paper we report the results of a previous project -the LingDy project, and propose a basic concept for corpus sharing strategy to support personal diachronic data sharing. This paper is a revised version of a handout at JADH2012, so readers should be careful that this content is based on results at the time of 2012.

show abstract

Visualization of concurrent markup

Jettka¹,

Stührenberg²

Balisage Series on Markup Technologies

View full text Add to dashboard Cite

The present paper deals with the visualization of concurrent markup. An initial discussion of the underlying model of XML instances demonstrates that valid XML exceeds the expressive power of trees. While some challenging features of concurrent markup, like overlaps, can be captured by minimally extended trees, there are other phenomena which can be adequately expressed in XML using constructs which instantiate advanced graph structures (e.g. discontinuous elements or repetitive structures). On the basis of two representation formats for concurrent markup, XStandoff and xLMNL, two distinct approaches towards its visualization are presented. The first method has been implemented in XSLT as an SVG-based 2D visualization strategy. Although it can be shown that this first approach provides an adequate (though not optimal) solution to overlapping structures, it is not capable of illustrating enhanced graph-based phenomena like the ones mentioned above. Therefore, some remarks about possible 3D visualizations are made which show how the adding of another dimension could contribute to the appropriately expressive visualization of concurrent markup. In addition, a prototypic implementation based on XSLT and X3D is discussed as first step towards a three-dimensional illustration.

show abstract

Linguistic Modeling of Information and Markup Languages

Cited by 3 publications

References 12 publications

Data Formats for Phonological Corpora

Data Formats for Phonological Corpora

Corpus Sharing Strategy for Descriptive Linguistics

Visualization of concurrent markup

Contact Info

Product

Resources

About