Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

Verdonik, Darinka; Kosem, Iztok; Vitez, Ana Zwitter; Krek, Simon; Stabej, Marko

doi:10.1007/s10579-013-9216-5

Cited by 26 publications

(14 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Besides Schmidt, Hedeland and Jettka (2017) and the ISO specification itself (ISO 2016), the role of TEI as a suitable basis of a standard for spoken language transcription has been discussed, among others, by Schmidt (2011) and Liégeois et al (2017). The TEI guidelines' chapter 8 on "Transcriptions of Speech" (TEI Consortium 2019) has also been used in CLARIN resources such as the GOS Corpus of Spoken Slovene (see Verdonik et al 2013) and as the basis for a CLARIN-wide format for parliamentary data. 3…”

Section: Related Workmentioning

confidence: 99%

CLARIN Web Services for TEI-annotated Transcripts of Spoken Language

Fisseni¹,

Schmidt²

2020

Linköping Electronic Conference Proceedings

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

CLARIN Web Services for TEI-annotated Transcripts of Spoken Language

Fisseni¹,

Schmidt²

2020

Linköping Electronic Conference Proceedings

View full text Add to dashboard Cite

show abstract

“…The SST treebank currently amounts to 29,488 tokens (3,188 utterances), which include both lexical tokens (words) and tokens signalling other types of verbal phenomena, such as filled pauses (fillers) and unfinished words, as well as some basic markers of prosody and extralinguistic speech events. The original segmentation, tokenization and spelling principles described by Verdonik et al (2013) have also been inherited by SST. Among the two types of Gos transcriptions (pronunciation-based and normalized spelling, both in lowercase only), subsequent manual annotations in SST have been performed on top of normalized transcriptions.…”

Section: Spoken Slovenian Treebankmentioning

confidence: 99%

“…Segmentation: Inheriting the manual segmentation of the reference Gos corpus, sentences (utterances) in SST correspond to "semantically, syntactically and acoustically delimited units" (Verdonik et al, 2013). As such, the utterance segmentation heavily depends on subjective interpretations of what is the basic functional unit in speech, in line with the multitude of existing segmentation approaches, based on syntax, semantics, prosody, or their various combinations (Degand and Simon, 2009).…”

Section: Modifications Of Speech Transcriptionmentioning

confidence: 99%

Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing

Dobrovoljc

Martinc

2018

Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

View full text Add to dashboard Cite

Despite the significant improvement of datadriven dependency parsing systems in recent years, they still achieve a considerably lower performance in parsing spoken language data in comparison to written data. On the example of Spoken Slovenian Treebank, the first spoken data treebank using the UD annotation scheme, we investigate which speechspecific phenomena undermine parsing performance, through a series of training data and treebank modification experiments using two distinct state-of-the-art parsing systems. Our results show that utterance segmentation is the most prominent cause of low parsing performance, both in parsing raw and pre-segmented transcriptions. In addition to shorter utterances, both parsers perform better on normalized transcriptions including basic markers of prosody and excluding disfluencies, discourse markers and fillers. On the other hand, the effects of written training data addition and speech-specific dependency representations largely depend on the parsing system selected.

show abstract

“…Typically, spoken language annotation denotes annotation of its representation in the form of written transcription. In the Spoken Slovenian Treebank, the spelling, tokenization and segmentation principles follow the transcription guidelines of the reference Gos corpus (Verdonik et al, 2013). The syntactic trees in the treebank span over individual utterances, manually delimited in the process of reference corpus transcription.…”

Section: Segmentation Tokenization and Spellingmentioning

confidence: 99%

The Universal Dependencies Treebank for Slovenian

Dobrovoljc¹,

Erjavec²,

Krek³

2017

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Self Cite

View full text Add to dashboard Cite

This paper introduces the Universal Dependencies Treebank for Slovenian. We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2. We explain the mapping of part-of-speech categories, morphosyntactic features, and the dependency relations, focusing on the more problematic language-specific issues. We conclude with a quantitative overview of the treebank and directions for further work.

show abstract

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

Cited by 26 publications

References 4 publications

CLARIN Web Services for TEI-annotated Transcripts of Spoken Language

CLARIN Web Services for TEI-annotated Transcripts of Spoken Language

Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing

The Universal Dependencies Treebank for Slovenian

Contact Info

Product

Resources

About