Large-scale processing, indexing and search system for Czech audio-visual cultural heritage archives

Nouza, Jan; Blavka, Karel; Žďánský, Jindřich; Červa, Petr; Silovský, Jan; Bohac, Marek; Chaloupka, Josef; Kucharova, Michaela; Seps, Ladislav

doi:10.1109/mmsp.2012.6343465

Cited by 5 publications

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the transcription task, we have been adapting and enhancing a largevocabulary continuous speech recognition (LVCSR) system developed previously in our lab. During the first two years of the 4-year project, we have implemented most of the required functionalities and utilized the system to process, transcribe and index more than 75.000 documents broadcast since 1993 to present [2]. That period did not pose a particular challenge for our research as we could employ the existing system trained for contemporary Czech.…”

Section: Introductionmentioning

confidence: 99%

Dealing with Bilingualism in Automatic Transcription of Historical Archive of Czech Radio

Nouza

Červa

Silovský

2013

New Trends in Image Analysis and Processing – ICIAP 2013

Self Cite

View full text Add to dashboard Cite

One of the biggest challenges in the automatic transcription of the historical audio archive of Czech and Czechoslovak radio is bilingualism. Two closely related languages, Czech and Slovak, are mixed in many archive documents. Both were the official languages in former Czechoslovakia (1918-1992) and both were used in media. The two languages are considered similar, although they differ in more than 75 % of their lexical inventories, which complicates automatic speech-to-text conversion. In this paper, we present and objectively measure the difference between the two languages. After that we propose a method suitable for automatic identification of two acoustically and lexically similar languages. It is based on employing 2 size-optimized parallel lexicons and language models. On large test data, we show that the 2 languages can be distinguished with almost 99 % accuracy. Moreover, the language identification module can be easily incorporated into a 2-pass decoding scheme with almost negligible additional computation costs. The proposed method has been employed in the project aimed at the disclosure of Czech and Czechoslovak oral cultural heritage.

show abstract

Section: Introductionmentioning

confidence: 99%

Dealing with Bilingualism in Automatic Transcription of Historical Archive of Czech Radio

Nouza

Červa

Silovský

2013

New Trends in Image Analysis and Processing – ICIAP 2013

Self Cite

View full text Add to dashboard Cite

show abstract

“…For this purpose, we have adapted our previously developed large-vocabulary continuous speech recognition (LVCSR) system to deal with broadcast recordings in Czech and Slovak and designed modules for speech indexation and search. During the first 18 months of the project, we have processed about 75,000 audio files (with total duration of 30,000 hours) and created a demo version of the web service that allows for smart search in the transcribed data [7].…”

Section: Introductionmentioning

confidence: 99%

Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives

Chaloupka

Nouza

Kucharova

2013

New Trends in Image Analysis and Processing – ICIAP 2013

Self Cite

View full text Add to dashboard Cite

Historical spoken documents represent a unique segment of national cultural heritage. In order to disclose the large Czech Radio audio archive to research community and to public, we have been developing a system whose aim is to transcribe automatically the archive files, index them and make them searchable. The transcription of contemporary (1 or 2 decades old) documents is based on the lexicon and statistical language model (LM) built from a large amount of recent texts available in electronic form. From the older periods (before 1990), however, digital texts do not exist. Therefore, we needed a) to find resources that represent language of those times, b) to convert them from their original form to text, c) to utilize this text for creating epoch specific lexicons and LMs, and eventually, d) to apply them in the developed speech recognition system. In our case, the main resources included: scanned historical newspapers, shorthand notes from the national parliament and subtitles from retro TV programs. When converted into text, they allowed us to built a more appropriate lexicon and to produce a preliminary version of the transcriptions. These were reused for unsupervised retraining of the final LM. In this way, we significantly improved the accuracy of the automatically transcribed radio news broadcast in 1969-1989 era, from initial 83 % to 88 %.

show abstract

Audio-Visual TV Broadcast Signal Segmentation

Chaloupka

2019

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Large-scale processing, indexing and search system for Czech audio-visual cultural heritage archives

Cited by 5 publications

References 11 publications

Dealing with Bilingualism in Automatic Transcription of Historical Archive of Czech Radio

Dealing with Bilingualism in Automatic Transcription of Historical Archive of Czech Radio

Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives

Audio-Visual TV Broadcast Signal Segmentation

Contact Info

Product

Resources

About