Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data

Schmidt, Drew; Ostrouchov, George; Chen, Wei-Chen; Patel, Pragneshkumar

doi:10.1109/sc.companion.2012.113

Cited by 10 publications

(9 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results contrast with the recommendation given by Schmidt et al, 2012c, Schmidt et al, 2012a to partition data with square blocking factors. The reason for that is likely due to the fact that the column dimension of the blocking factors was chosen equal to the number of variables related to each variable Yj, thus facilitating the computation of the distributed matrix algebra operations considered in the PLS algorithm.…”

Section: Resultscontrasting

confidence: 99%

“…Therefore, to properly establish the communicator – that is the object “to define which collection of processes may communicate with each other” – is of paramount importance. Since pbdR is focused on the SPMD programming paradigm (Chen et al, 2012a, Schmidt et al, 2012c, Ostrouchov et al, 2013), users need to initialize the communicator(s) at the beginning of a script with the instruction init(). This enables the initialization of the processors (or task IDs) “to specify the source and destination of messages”.…”

Section: Introductionmentioning

confidence: 99%

“…On the other hand, pbdBASE presents the necessary wrappers or interfaces and routines for communication with low-level routines written in Fortran and available in ScaLAPACK (Schmidt et al, 2012b). All pbdR libraries “install and run on a single machine as well as on shared memory and distributed clusters” (Schmidt et al, 2012c, p. 811).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Big data in multi-block data analysis: An approach to parallelizing Partial Least Squares Mode B algorithm

Martínez-Ruiz

Montañola‐Sales

2019

Heliyon

View full text Add to dashboard Cite

Partial Least Squares (PLS) Mode B is a multi-block method and a tightly coupled algorithm for estimating structural equation models (SEMs). Describing key aspects of parallel computing, we approach the parallelization of the PLS Mode B algorithm to operate on large distributed data. We show the scalability and performance of the algorithm at a very fine-grained level thanks to the versatility of pbdR, a R-project library for parallel computing. We vary several factors under different data distribution schemes in a supercomputing environment. Shorter elapsed times are obtained for the square-blocking factor 16 16 using a grid of processors as square as possible and non-square blocking factors 1000 4 and 10000 4 using an one-column grid of processors. Depending on the configuration, distributing data in a larger number of cores allows reaching speedups of up to 121 over the CPU implementation. Moreover, we show that SEMs can be estimated with big data sets using current state-of-the-art algorithms for multi-block data analysis.

show abstract

Section: Resultscontrasting

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Big data in multi-block data analysis: An approach to parallelizing Partial Least Squares Mode B algorithm

Martínez-Ruiz

Montañola‐Sales

2019

Heliyon

View full text Add to dashboard Cite

show abstract

“…Methods previously discussed can be adopted, such as cluster analysis and OLAP. Statistical computing packages like R (language) can be useful as well (Schmidt, Ostrouchov, Chen, & Patel, 2012). Some data mining tools and software providers include: enterprise miner from SAS, intelligent miner from IBM, setminer from SGI, clementine from SPSS, DB miner from DB Miner Technology Inc., PRW from Unica Technolgies Inc., Darwin from thinking machines, greenplum from EMC, etc.…”

Section: Expansion Of Current Aismentioning

confidence: 99%

Integrating Data Mining Into Managerial Accounting System: Challenges and Opportunities

Wang¹,

Wang²

2016

CBR

View full text Add to dashboard Cite

Data mining involves extracting information from large data sets, discovering the hidden relationships and unknown dependencies, and supporting strategic decision-making tasks. The alignment of data mining and business would bring benefits to the organization's management. The study investigated the adoption of data mining technologies in managerial accounting system, concentrating on the challenges and opportunities. The research showed that with the technology adoption, managerial functions could be improved and current information system could be upgraded. Since the technical progresses are reshaping the world of business and accountancy, it is significant for accountants and finance professionals to exploit information technologies.

show abstract

“…However, this approach does not exploit efficient memory sharing in the cloud. To solve the low programmability of traditional distributed computing environments, pbdR [9] tightly couples R with the MPI libraries, which enables developing high-level distributed data parallelism in R and also utilizing HPC platforms, but suffers the fault tolerance problems. SparkR [10] is an R package that provides a lightweight frontend to use Apache Spark from R. It exposes the low-level Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.…”

Section: Related Workmentioning

confidence: 99%

Unified Programming Model and Software Framework for Big Data Machine Learning and Data Analytics

Tang

Dong

et al. 2015

2015 IEEE 39th Annual Computer Software and Applications Conference

View full text Add to dashboard Cite

In a new era of Big Data, the rapid growth of the applications, such as social media and web-search, requires efficient and scalable machine learning and statistical analytical algorithms. However, there lacks easy-to-use and efficient software frameworks or systems that can support fast development of such big data analytical algorithms. To solve these problems, we propose Octopus, an easy-to-use and efficient analytical system for big data. Octopus allows data analysts conduct complex data analytics for big data with traditional programming languages and methods in an easy and efficient way. To achieve the goal of ease-to-use, we propose a matrix-based unified programming model, which is the core of many data-intensive statistical applications such as numerical analysis and data mining. Further, in order to improve the performance, the Octopus software framework adopts various distributed computing platforms, including Hadoop MapReduce, Spark and MPI. On these computing platforms, we design several parallel matrix computation algorithms, which are suitable for various scenarios. Finally, the features of Octopus are encapsulated into a library with matrix-based APIs and exposed to users as an R package. R is a widelyused statistical programming language and supports diversified data analysis tasks through extension packages. Experimental results show that Octopus achieves efficient performance and near linear scalability.

show abstract

Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data

Cited by 10 publications

References 5 publications

Big data in multi-block data analysis: An approach to parallelizing Partial Least Squares Mode B algorithm

Big data in multi-block data analysis: An approach to parallelizing Partial Least Squares Mode B algorithm

Integrating Data Mining Into Managerial Accounting System: Challenges and Opportunities

Unified Programming Model and Software Framework for Big Data Machine Learning and Data Analytics

Contact Info

Product

Resources

About