2018
DOI: 10.48550/arxiv.1810.09753
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Goodness-of-Fit Tests for Large Datasets

Abstract: Nowadays, data analysis in the world of Big Data is connected typically to data mining, descriptive or exploratory statistics, e. g. cluster analysis, classification or regression analysis. Aside these techniques there is a huge area of methods from inferential statistics that are rarely considered in connection with Big Data. Nevertheless, inferential methods are also of use for Big Data analysis, especially for quantifying uncertainty.The article at hand will provide some insights to methodological and techn… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 23 publications
0
5
0
Order By: Relevance
“…As thoroughly discussed in Lazariv & Lehmann (2018), as the sample size increases, the discerning power of KS tests increases. However, KS tests cannot take into account error in the measurements, and so it is possible for a test to become overpowered.…”
Section: Appendix Ks and Ad Testsmentioning
confidence: 91%
See 1 more Smart Citation
“…As thoroughly discussed in Lazariv & Lehmann (2018), as the sample size increases, the discerning power of KS tests increases. However, KS tests cannot take into account error in the measurements, and so it is possible for a test to become overpowered.…”
Section: Appendix Ks and Ad Testsmentioning
confidence: 91%
“…As in Paper I, we quantify these differences in property distributions using bootstrapped two-sample Kolmogorov-Smirnov (KS) tests and Anderson-Darling (AD) tests. Using the full sample size of the galaxies results in an overpowered statistic as described in Lazariv & Lehmann (2018), where the tests detected every small variation in the distribution, even those that were well below the measurement errors. Since these overpowered tests are unreliable, we instead perform a bootstrapped version, taking 1000 random subsamples and reporting the average p-value of the tests performed on the subsamples.…”
Section: Virializationmentioning
confidence: 99%
“…This is likely an overrepresentation of how different the distributions truly are though because the sample sizes of the clouds and clusters for the galaxies are so large. As discussed in Lazariv & Lehmann (2018), as the sample size becomes larger, K-S and AD tests have increasingly higher power to discern small differences in the distributions. However, these tests do not take into account the errors in the parameters, and so at large sample sizes, these tests can discern differences that are smaller than the errors in the measurements, which we consider unreliable.…”
Section: Property Distribution Comparisonsmentioning
confidence: 99%
“…To assess whether our LC-MS1 datasets following the correction and z-transformation steps was normally distributed, we plotted the data as histogram and boxplot. We further performed the nonparametric one-sample Kolmogorov-Smirnov (K-S) test (85) well suited to analysing big data (86). Both histogram and boxplot of the corrected data were asymmetrical with most values being on the low range (Supplementary Figure 6A-B), which revealed that this dataset was not normally distributed.…”
Section: Assessing the Normality Of Lc-ms1 Datasetsmentioning
confidence: 99%