2015
DOI: 10.1162/coli_a_00225
|View full text |Cite
|
Sign up to set email alerts
|

Large Linguistic Corpus Reduction with SCP Algorithms

Abstract: Linguistic corpus design is a critical concern for building rich annotated corpora useful in different domains of applications. For example, speech technologies such as ASR (Automatic Speech Recognition) or TTS (Text-to-Speech) need a huge amount of speech data to train data-driven models or to produce synthetic speech. Collecting data is always related to costs (recording speech, verifying annotations, etc.), and as a rule of thumb, the more data you gather, the more costly your application will be. Within th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 18 publications
0
7
0
Order By: Relevance
“…• SC: this system is based on a set covering problem which is solved by a greedy strategy [5]. The best utterances are selected to cover η times each linguistic feature.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…• SC: this system is based on a set covering problem which is solved by a greedy strategy [5]. The best utterances are selected to cover η times each linguistic feature.…”
Section: Methodsmentioning
confidence: 99%
“…The most commonly used algorithmic strategy is the greedy one. In [5], the combination of agglomerative and spitting greedy phases have been assessed and compared to a Lagrangian relaxation based algorithm to derive full multirepresented coverings of diphoneme and triphonemes. The Lagrangian relaxation approach provides a lower bound showing that greedy strategies build solutions close to optimal ones.…”
Section: Previous Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The main proposed approaches consist in extracting, from a large textual corpus, for instance the target book to be vocalized, a minimal subset of sentences that maximizes an optimisation criterion. This criterion is often related to the maximization of the linguistic coverage [4,5,6] (formalized as a set covering problem) or the closeness to a target linguistic distribution [7,8]. Different algorithms have been compared and the mainly used approach is the greedy one, providing a good trade-off between the computational time and closeness to the optimal solution [6].…”
Section: Introductionmentioning
confidence: 99%
“…Some studies, as in [6], point out that the designed voices tend to be composed of utterances shorter than those of the initial pool. In [4], the set covering problem is dealt with a greedy strategy which selects a sub-corpus with an average length of 20 phonemes per sentence out of an initial corpus with an average length of 74 (this approach will be named set covering in the following).…”
Section: Introductionmentioning
confidence: 99%