2017
DOI: 10.1109/tpds.2016.2603511
|View full text |Cite
|
Sign up to set email alerts
|

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Abstract: With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performe… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
101
0
3

Year Published

2017
2017
2021
2021

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 369 publications
(115 citation statements)
references
References 28 publications
0
101
0
3
Order By: Relevance
“…2) Random Forest Algorithm: Random Forest algorithm is an ensemble classifier algorithm which uses ‗bagging' to create multiple decision trees and classifies new incoming data instance to a class or group [11]. The trees are built not pruned [12].…”
Section: B Algorithms 1) K-nearest Neighbors Algorithm: the K-nearestmentioning
confidence: 99%
“…2) Random Forest Algorithm: Random Forest algorithm is an ensemble classifier algorithm which uses ‗bagging' to create multiple decision trees and classifies new incoming data instance to a class or group [11]. The trees are built not pruned [12].…”
Section: B Algorithms 1) K-nearest Neighbors Algorithm: the K-nearestmentioning
confidence: 99%
“…The imaging spectrometer acquiring the hyperspectral image data cannot be directly applied and classified analysis, which needs to be analyzed. Therefore, the preprocessing of hyperspectral remote sensing image in general includes atmospheric radiation correction, geometry correction, and noise removal [45,[55][56][57]. In the preprocessing of hyperspectral image, radiometric correction is the mainly steps.…”
Section: Methodsmentioning
confidence: 99%
“…The bat algorithm combined the major advantages between particle swarm optimization and genetic algorithm and Harmony Search is applied to yield optimal parameters in the DBN. Second, random forest is suitable for handling large data due to its parallelization [28]. It has been combined with the Spark [28], heuristic bootstrap sampling method [29], kernel principal component analysis [30], and other technologies to perform fault diagnosis and regression tasks [31,32].…”
Section: Mathematical Problems In Engineeringmentioning
confidence: 99%
“…Second, random forest is suitable for handling large data due to its parallelization [28]. It has been combined with the Spark [28], heuristic bootstrap sampling method [29], kernel principal component analysis [30], and other technologies to perform fault diagnosis and regression tasks [31,32]. Owing to the improvement of the forecasting accuracy for highdimensional and large-scale wind turbine data, we propose an optimized random forest method which consists of a dimension reduction procedure and the weighted voting process for the short-term WPF.…”
Section: Mathematical Problems In Engineeringmentioning
confidence: 99%
See 1 more Smart Citation