2018
DOI: 10.1007/s10579-017-9410-y
|View full text |Cite
|
Sign up to set email alerts
|

The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 30 publications
(16 citation statements)
references
References 10 publications
0
16
0
Order By: Relevance
“…The data has been prepared as a corpus [12]. 3 Importantly, the annotation includes a linguistic sentence segmentation and tokenization and the relation of original and normalized text has been preserved, allowing the timings of the aligned normalized text to be mapped to each original text token, thus bridging the gap between speech and language processing.…”
Section: Data and Setupmentioning
confidence: 99%
See 2 more Smart Citations
“…The data has been prepared as a corpus [12]. 3 Importantly, the annotation includes a linguistic sentence segmentation and tokenization and the relation of original and normalized text has been preserved, allowing the timings of the aligned normalized text to be mapped to each original text token, thus bridging the gap between speech and language processing.…”
Section: Data and Setupmentioning
confidence: 99%
“…We limit our analysis to the German sub-corpus of the Spoken Wikipedia which contains some 1000 articles totaling 386 h of audio (360 h after VAD) and 3 M word tokens read by 350 different speakers [12]. The alignment favors quality over coverage and hence only about 70 % of the word tokens have alignment information available.…”
Section: Data and Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…• Text-Speech Aligner: we use the text-speech aligner published by [18] which uses a variation of the SailAlign algorithm [19] implemented using Sphinx-4 [20]. The alignments are stored in a format that guarantees the original text to remain unchanged (which is important to be able to combine them with syntactic and other annotations).…”
Section: Processing Toolsmentioning
confidence: 99%
“…Various strategies have been proposed to collect speech and text resources for technology development, for example harvesting existing data like broadcast news and online publications, crowd-sourcing, web crawling, dedicated data collection campaigns, etcetera [7][8][9][10][11][12][13]. Both data types are required for language and speech technology development, and constructing comprehensive text corpora is just as important as creating speech resources.…”
Section: Introductionmentioning
confidence: 99%