2013
DOI: 10.1017/s1351324912000381
|View full text |Cite
|
Sign up to set email alerts
|

Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting

Abstract: This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.9… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 15 publications
(7 citation statements)
references
References 43 publications
0
7
0
Order By: Relevance
“…In the early days of DE, rulebased approaches leveraged linguistic cues observed in definitional data (Rebeyrolle and Tanguy, 2000;Klavans and Muresan, 2001;Malaisé et al, 2004;Saggion and Gaizauskas, 2004;Storrer and Wellinghoff, 2006). However, in order to deal with problems like language dependence and domain specificity, machine learning was incorporated in more recent contributions (Del Gaudio et al, 2013), which focused on encoding informative lexico-syntactic patterns in feature vectors (Cui et al, 2005;Fahmi and Bouma, 2006;Westerhout and Monachesi, 2007;Borg et al, 2009), both in supervised and semi-supervised settings (Reiplinger et al, 2012;.…”
Section: Introductionmentioning
confidence: 99%
“…In the early days of DE, rulebased approaches leveraged linguistic cues observed in definitional data (Rebeyrolle and Tanguy, 2000;Klavans and Muresan, 2001;Malaisé et al, 2004;Saggion and Gaizauskas, 2004;Storrer and Wellinghoff, 2006). However, in order to deal with problems like language dependence and domain specificity, machine learning was incorporated in more recent contributions (Del Gaudio et al, 2013), which focused on encoding informative lexico-syntactic patterns in feature vectors (Cui et al, 2005;Fahmi and Bouma, 2006;Westerhout and Monachesi, 2007;Borg et al, 2009), both in supervised and semi-supervised settings (Reiplinger et al, 2012;.…”
Section: Introductionmentioning
confidence: 99%
“…is method has been used successfully in various medical, computer, and natural sciences studies [49,50]. e primary purpose of this approach is to deal with very imbalanced datasets [51].…”
Section: Methods Usedmentioning
confidence: 99%
“…Therefore, accuracy is not considered as a good measure if data is highly unsymmetrical or imbalanced. As our original dataset is imbalanced for gender class, so considering only the accuracy is not a good measure to evaluate the performance of author profiling techniques (Del Gaudio, Batista and Branco 2014). Therefore, F 1 measure is also considered to evaluate the performance of author profiling methods because it provides the unbiased reflections for skewed data or an uneven class distribution.…”
Section: Methodsmentioning
confidence: 99%