Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 2014
DOI: 10.3115/v1/w14-1809
|View full text |Cite
|
Sign up to set email alerts
|

Automatic evaluation of spoken summaries: the case of language assessment

Abstract: This paper investigates whether ROUGE, a popular metric for the evaluation of automated written summaries, can be applied to the assessment of spoken summaries produced by non-native speakers of English. We demonstrate that ROUGE, with its emphasis on the recall of information, is particularly suited to the assessment of the summarization quality of non-native speakers' responses. A standard baseline implementation of ROUGE-1 computed over the output of the automated speech recognizer has a Spearman correlatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 17 publications
0
8
0
Order By: Relevance
“…We use significance test to prove that similarity metric is reliable even though the numerical difference of similarity scores in experiment is little. Because the similarity scores of generated summaries do not follow normal distribution, we take Kruskal-Wallis test (Loukina et al, 2014;Albert, 2017) as our significance test to measure that the difference of similarity results of three methods is significant or not. As shown in Table 9, all pvalues are less than 0.05.…”
Section: Significance Test On Similarity Resultsmentioning
confidence: 99%
“…We use significance test to prove that similarity metric is reliable even though the numerical difference of similarity scores in experiment is little. Because the similarity scores of generated summaries do not follow normal distribution, we take Kruskal-Wallis test (Loukina et al, 2014;Albert, 2017) as our significance test to measure that the difference of similarity results of three methods is significant or not. As shown in Table 9, all pvalues are less than 0.05.…”
Section: Significance Test On Similarity Resultsmentioning
confidence: 99%
“…In addition to the relatively straightforward method of using CVA models and cosine similarity calculations to produce the content features, additional approaches have been investigated for scoring spontaneous speech. Some of these include using latent semantic analysis (LSA; Metallinou & Cheng, ), pointwise mutual information (Xie, Evanini, & Zechner, ), and the ROUGE summarization evaluation metric (Lin & Rey, ; Loukina, Zechner, & Chen, ).…”
Section: Discussionmentioning
confidence: 99%
“…Since the early 2000s, several groups have built systems for scoring less constrained and more unpredictable speaking items, which incorporated additional sources of information for scoring, for example, diversity of vocabulary or grammatical complexity (Bernstein et al, ; Chen & Zechner, ; Strik, Van De Loo, Van Doremalen, & Cucchiarini, ; Yoon, Bhat, & Zechner, ; Zechner, Higgins, Xi, & Williamson, ). Recent work has also looked at evaluating the content relevance of spoken responses (Loukina, Zechner, & Chen, ; Somasundaran, Lee, Chodorow, & Wang, ; Xie, Evanini, & Zechner, ).…”
Section: Overview Of Item Types Usedmentioning
confidence: 99%