Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2011
DOI: 10.1145/2020408.2020496
|View full text |Cite
|
Sign up to set email alerts
|

Leakage in data mining

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 97 publications
(41 citation statements)
references
References 10 publications
0
41
0
Order By: Relevance
“…For example, in a top medical data mining contest, the accession numbers revealed the disease status of the patient, which was the aim of the contest 57 . In addition, pattern analysis of a large amount of public data revealed temporal and spatial commonalities in the assignment system that allowed predictions of US social security numbers (SSNs) from quasi-identifiers 58 .…”
Section: Identity Tracing Attacksmentioning
confidence: 99%
“…For example, in a top medical data mining contest, the accession numbers revealed the disease status of the patient, which was the aim of the contest 57 . In addition, pattern analysis of a large amount of public data revealed temporal and spatial commonalities in the assignment system that allowed predictions of US social security numbers (SSNs) from quasi-identifiers 58 .…”
Section: Identity Tracing Attacksmentioning
confidence: 99%
“…In some contexts, it may not be possible to correct for all of these differences to the degree that a deep network is unable to use them. Moreover, the complexity of deep networks makes it difficult to determine when their predictions are likely to be based on such nominally irrelevant features of the data (called ‘leakage’ in other fields [ 167 ]). When we are not careful with our data and models, we may inadvertently say more about the way the data were collected (which may involve a history of unequal access and discrimination) than about anything of scientific or predictive value.…”
Section: Deep Learning and Patient Categorizationmentioning
confidence: 99%
“…Raw microarray and RNA-seq data were log 2 transformed where necessary after which the data was (inner) merged and randomly divided into a training (2/3) and test set (1/3). As the test set should remain hidden from the training set, raw microarray and RNA-seq data was used instead of normalized data to prevent data leakage (56). Elastic net regression was subsequently performed using the R glmnet (v2.0) (57) package where we set the alpha to 0.8.…”
Section: Classification Analysismentioning
confidence: 99%