Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address

Lemmenmeier-Batinić, Dolores

doi:10.4312/slo2.0.2021.1.123-144

Cited by 4 publications

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We converted XML transcripts produced by OrthoNormal into TEI-XML encoded transcripts, following the TEI guidelines for transcriptions of speech implemented in the Corpus of Serbian Forms of Address (Lemmenmeier-Batinić, 2021 , pp. 131–132).…”

Section: Corpus Compilationmentioning

confidence: 99%

Map Task Corpus of Heritage BCMS spoken by second-generation speakers in Switzerland

Lemmenmeier-Batinić

Batinić

Escher

2023

Lang Resources & Evaluation

Self Cite

View full text Add to dashboard Cite

In this paper, we present a corpus for heritage Bosnian/Croatian/Montenegrin/Serbian (BCMS) spoken in German-speaking Switzerland. The corpus consists of elicited conversations between 29 second-generation speakers originating from different regions of former Yugoslavia. In total, the corpus contains 30 turn-aligned transcripts with an average length of 6 min. It is enriched with extensive speakers’ metadata, annotations, and pre-calculated corpus counts. The corpus can be accessed through an interactive corpus platform that allows for browsing, querying, and filtering, but also for creating and sharing custom annotations. Principal user groups we address with this corpus are researchers of heritage BCMS, as well as students and teachers of BCMS living in diaspora. In addition to introducing the corpus platform and the workflows we adopted to create it, we also present a case study on BCMS spoken by a pair of siblings who participated in the map task, and discuss advantages and challenges of using this corpus platform for linguistic research.

show abstract

Section: Corpus Compilationmentioning

confidence: 99%

Map Task Corpus of Heritage BCMS spoken by second-generation speakers in Switzerland

Lemmenmeier-Batinić

Batinić

Escher

2023

Lang Resources & Evaluation

Self Cite

View full text Add to dashboard Cite

show abstract

“…The COPA-SR dataset (Choice of Plausible Alternatives in Serbian) is a translation of the English COPA dataset . CorFoA is a corpus of Serbian forms of the address containing transcripts of biographical interviews with 19 participants (Lemmenmeier-Batinić et al, 2021). MLNews is a comprehensive corpus of news articles that are Serbian language-related.…”

Section: Monolingual Corporamentioning

confidence: 99%

Creating a stop word dictionary in Serbian

Marovac¹,

Avdić²,

Ljajić³

2021

Sci Pub Univ Novi Pazar Ser A

View full text Add to dashboard Cite

By using natural language processing techniques, it is possible to get a lot of information from the extraction of document topics through mapping of document key words or content-based classification of documents, etc. To get this information, an important step is to separate words that carries informative value in a sentence from those words that do not affect its meaning. By using dictionaries of stop words specific to each natural language, the marking of words that do not carry meaning in the sentence is achieved. This paper presents creating a stop word dictionary in Serbian. The influence of stop words to the text processing is presented on three different data set. It is shown that by using proposed dictionary of Serbian stop words the data set dimension is reduced from 15% to 39%, while the quality of the obtained n-gram language models is improved.

show abstract

“…Speech recognition technology has become one of the technical means of language communication in today's society. How to better apply this technology to assist people's communication is the focus of research [2]. Therefore, combining the principle of deep autoencoder and deep learning algorithm, an English vocabulary and English speech recognition model based on deep learning algorithm is proposed, which focuses on the influence of speech recognition framework on speech corpus.…”

Section: Introductionmentioning

confidence: 99%

Research on English Vocabulary and Speech Corpus Recognition Based on Deep Learning

Zhen¹

2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

In order to investigate how to recognize English words and speech corpus, an English vocabulary and English speech recognition model based on deep learning algorithm was proposed. Through recommending key technical problems and solutions based on deep learning algorithm, how to realize the recognition of English vocabulary and speech corpus was investigated. In the research, the accuracy of the method on the English vocabulary and speech corpus recognition based on the deep learning algorithm increased 79% over the previous methods. Combined with the principle of the deep automatic encoder and deep learning algorithm, the research emphasis was on the effects of speech recognition framework for speech corpus. The speech recognition research based on the theory of deep learning not only had a theoretical guidance meaning but also had the use value in the practical application.

show abstract

Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address

Cited by 4 publications

References 10 publications

Map Task Corpus of Heritage BCMS spoken by second-generation speakers in Switzerland

Map Task Corpus of Heritage BCMS spoken by second-generation speakers in Switzerland

Creating a stop word dictionary in Serbian

Research on English Vocabulary and Speech Corpus Recognition Based on Deep Learning

Contact Info

Product

Resources

About