2020
DOI: 10.1186/s40537-020-00301-0
|View full text |Cite
|
Sign up to set email alerts
|

Investigating class rarity in big data

Abstract: IntroductionWhen called upon to define big data, researchers and practitioners in the field of data science frequently refer to the six V's: volume, variety, velocity, variability, value, and veracity [1]. Volume, most certainly the best-known property of big data, is associated with the profusion of data produced by an organization. Variety covers the handling of structured, unstructured, and semi-structured data. Velocity takes into account how quickly data is manufactured, issued, and dealt with. Variabilit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
10
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1
1
1

Relationship

3
6

Authors

Journals

citations
Cited by 15 publications
(10 citation statements)
references
References 27 publications
0
10
0
Order By: Relevance
“…The application of GBDT algorithms for classification and regression tasks to many types of Big Data is well studied [ 11 – 13 ]. To the best of our knowledge, this is the first survey specifically dedicated to the CatBoost implementation of ’s.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The application of GBDT algorithms for classification and regression tasks to many types of Big Data is well studied [ 11 – 13 ]. To the best of our knowledge, this is the first survey specifically dedicated to the CatBoost implementation of ’s.…”
Section: Introductionmentioning
confidence: 99%
“…For example, Spark MLlib’s GradientBoostedTrees module, [ 15 ], is one such implementation. For examples of GBDT applications in Spark please see [ 16 ] and [ 11 ] . However, as long as the distributed framework supports a language that the Gradient Boosted Decision Tree implementation has an application programming interface available for, it is possible to use that implementation in the framework; thus, freeing the user to select from the most appealing GBDT implementation available.…”
Section: Introductionmentioning
confidence: 99%
“…Future work with the CSE-CIC-IDS2018 dataset can investigate other families of attacks, individual web attack labels (as compared to the combined web attack labels used in this study), and the effects of rarity [50]. Other datasets could also be included for future work, as well as additional performance metrics, classifiers, and sampling techniques.…”
Section: Discussionmentioning
confidence: 99%
“…Future work can explore Naive Bayes and its noteworthy classification performance when no sampling is applied under conditions of severe class imbalance and rarity (as well as its insensitivity to improvements when applying RUS). Other datasets could also be included for future work, as well as additional performance metrics, families of attacks, classifiers, sampling techniques, and rarity levels [56].…”
Section: Discussionmentioning
confidence: 99%