2021
DOI: 10.4312/slo2.0.2021.1.123-144
|View full text |Cite
|
Sign up to set email alerts
|

Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address

Abstract: This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resourc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…We converted XML transcripts produced by OrthoNormal into TEI-XML encoded transcripts, following the TEI guidelines for transcriptions of speech implemented in the Corpus of Serbian Forms of Address (Lemmenmeier-Batinić, 2021 , pp. 131–132).…”
Section: Corpus Compilationmentioning
confidence: 99%
“…We converted XML transcripts produced by OrthoNormal into TEI-XML encoded transcripts, following the TEI guidelines for transcriptions of speech implemented in the Corpus of Serbian Forms of Address (Lemmenmeier-Batinić, 2021 , pp. 131–132).…”
Section: Corpus Compilationmentioning
confidence: 99%
“…The COPA-SR dataset (Choice of Plausible Alternatives in Serbian) is a translation of the English COPA dataset . CorFoA is a corpus of Serbian forms of the address containing transcripts of biographical interviews with 19 participants (Lemmenmeier-Batinić et al, 2021). MLNews is a comprehensive corpus of news articles that are Serbian language-related.…”
Section: Monolingual Corporamentioning
confidence: 99%
“…Speech recognition technology has become one of the technical means of language communication in today's society. How to better apply this technology to assist people's communication is the focus of research [2]. Therefore, combining the principle of deep autoencoder and deep learning algorithm, an English vocabulary and English speech recognition model based on deep learning algorithm is proposed, which focuses on the influence of speech recognition framework on speech corpus.…”
Section: Introductionmentioning
confidence: 99%