Imbalanced data classification using MapReduce and relief

Jȩdrzejowicz, Joanna; Kostrzewski, Robert; Neumann, Jakub; Zakrzewska, Magdalena

doi:10.1080/24751839.2018.1440454

Cited by 5 publications

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…bags of not clearly labeled instances [120, 121], dealing with non-monotonic relationships [9, 31], dealing with survival data (i.e. data exploring the duration of time until one or more events happen) [6], dealing with imbalanced data [88, 49], clustering [24], and feature extraction [105].…”

Section: A Review Of Relief-based Algorithmsmentioning

confidence: 99%

Relief-based feature selection: Introduction and review

Urbanowicz

Meeker

Cava

et al. 2018

Journal of Biomedical Informatics

954

450

View full text Add to dashboard Cite

Feature selection plays a critical role in biomedical data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that have gained appeal by striking an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.

show abstract

Section: A Review Of Relief-based Algorithmsmentioning

confidence: 99%

Relief-based feature selection: Introduction and review

Urbanowicz

Meeker

Cava

et al. 2018

Journal of Biomedical Informatics

954

450

View full text Add to dashboard Cite

show abstract

“…However, the protein was not quantified for this cohort. The imbalanced distribution of individuals without kidney dysfunction in this group of SCD patients likely affects the performance of the different regression models ( Jedrzejowicz et al, 2018 ), tending to be biased toward the normal ranges ( KrishnaVeni and Sobha, 2011 ) and potentially failing to identify possible signals.…”

Section: Limitationsmentioning

confidence: 99%

Investigations of Kidney Dysfunction-Related Gene Variants in Sickle Cell Disease Patients in Cameroon (Sub-Saharan Africa)

et al. 2021

View full text Add to dashboard Cite

BackgroundRenal dysfunctions are associated with increased morbidity and mortality in sickle cell disease (SCD). Early detection and subsequent management of SCD patients at risk for renal failure and dysfunctions are essential, however, predictors that can identify patients at risk of developing renal dysfunction are not fully understood.MethodsIn this study, we have investigated the association of 31 known kidney dysfunctions-related variants detected in African Americans from multi-ethnic genome wide studies (GWAS) meta-analysis, to kidney-dysfunctions in a group of 413 Cameroonian patients with SCD. Systems level bioinformatics analyses were performed, employing protein-protein interaction networks to further interrogate the putative associations.ResultsUp to 61% of these patients had micro-albuminuria, 2.4% proteinuria, 71% glomerular hyperfiltration, and 5.9% had renal failure. Six variants are significantly associated with the two quantifiable phenotypes of kidney dysfunction (eGFR and crude-albuminuria): A1CF-rs10994860 (P = 0.02020), SYPL2-rs12136063 (P = 0.04208), and APOL1 (G1)-rs73885319 (P = 0.04610) are associated with eGFR; and WNT7A-rs6795744 (P = 0.03730), TMEM60-rs6465825 (P = 0.02340), and APOL1 (G2)-rs71785313 (P = 0.03803) observed to be protective against micro-albuminuria. We identified a protein-protein interaction sub-network containing three of these gene variants: APOL1, SYPL2, and WNT7A, connected to the Nuclear factor NF-kappa-B p105 subunit (NFKB1), revealed to be essential and might indirectly influence extreme phenotypes. Interestingly, clinical variables, including body mass index (BMI), systolic blood pressure, vaso-occlusive crisis (VOC), and haemoglobin (Hb), explain better the kidney phenotypic variations in this SCD population.ConclusionThis study highlights a strong contribution of haematological indices (Hb level), anthropometric variables (BMI, blood pressure), and clinical events (i.e., vaso-occlusive crisis) to kidney dysfunctions in SCD, rather than known genetic factors. Only 6/31 characterised gene-variants are associated with kidney dysfunction phenotypes in SCD samples from Cameroon. The data reveal and emphasise the urgent need to extend GWAS studies in populations of African ancestries living in Africa, and particularly for kidney dysfunctions in SCD.

show abstract

“…Yeast dataset (Hu et al, 2015) have 8 real attributes with 1,484 instances. Various kinds of datasets from the Keel dataset repository (Verbiest et al, 2012;Ahmed et al, 2019;Gong and Kim, 2017;Jedrzejowicz et al, 2018;Fernández et al, 2017;Wang, 2019) are mostly used in handling imbalanced datasets. Liver-Disorders-Bupa (Ebenuwa et al, 2019) contains 345 instances with 7 attributes where attribute types are Categorical, integer and real.…”

Section: Used Dataset In Researchesmentioning

confidence: 99%