Reference architectures for big data and machine learning include not only interconnected building blocks but important considerations (among others) for scalability, manageability and usability issues as well. Leveraging on such reference architectures, the automated deployment of distributed toolsets and frameworks on various clouds is still challenging due to the diversity of technologies and protocols. The paper focuses particularly on the widespread Apache Spark cluster with Jupyter as the particularly addressed framework, and the Occopus cloud-agnostic orchestrator tool for automating its deployment and maintenance stages. The presented approach has been demonstrated and validated with a new, promising text classification application on the Hungarian academic research infrastructure, the OpenStack-based MTA Cloud. The paper explains the concept, the applied components, and illustrates their usage with real use-case measurements. K E Y W O R D S big data, cloud, machine learning, parallel and distributed execution, reference architectures, text classification 1 INTRODUCTION Research in different scientific fields (e.g., natural and social sciences) often require extremely huge computational resources and storage capacity to handle Big Data problems. Traditional sequential data processing algorithms are not sufficient to analyze this large volume of data. For efficient processing and analysis new approaches, techniques and tools are necessary. Moreover, cloud infrastructures and services are becoming even more popular and are nowadays widely used to address the computation and storage requirements of many scientific and commercial Big Data applications. Their widespread usage is a consequence of the dynamic and scalable nature of the services maintained by cloud providers.However, there are several challenges that a data scientist has to face when planning the use or deployment of any Big Data platform on cloud(s). 1The selection of the appropriate cloud provider(s) is always a tiresome process since several factors has to be considered, even when only a generic Infrastructure-as-a-Service (IaaS) provider is required: private (e.g., Agrodat Cloud 2 ), federated (e.g., MTA Cloud 3 or pan-European EGI FedCloud 4 ), or public cloud (e.g., Amazon AWS 5 ).The Hungarian Academy of Sciences (MTA) provides free IaaS cloud (MTA cloud) services for research communities and easy to use, dynamic infrastructures adapted to the actual project requirements. MTA Cloud was established to accelerate research for the scientists of MTA. Nearly 100 projects have been deployed on MTA Cloud since its opening and more and more projects require to use Big Data and machine learning applications.However, the large number of artificial intelligence (AI) tools available for clouds are very complex, and their proper deployment and configuration
Reference architectures for Big Data, machine learning and stream processing include not only recommended practices and interconnected building blocks but considerations for scalability, availability, manageability, and security as well. However, the automated deployment of multi-VM platforms on various clouds leveraging on such reference architectures may raise several issues. The paper focuses particularly on the widespread Apache Spark Big Data platform as the baseline and the Occopus cloud-agnostic orchestrator tool. The set of new generation reference architectures are configurable by human-readable descriptors according to available resources and cloud-providers, and offers various components such as Jupyter Notebook, RStudio, HDFS, and Kafka. These pre-configured reference architectures can be automatically deployed even by the data scientist on-demand, using a multi-cloud approach for a wide range of cloud systems like Amazon AWS, Microsoft Azure, Open-Stack, OpenNebula, CloudSigma, etc. Occopus enables the scaling of cluster-oriented components (such as Spark) of the instantiated reference architectures. The presented solution was successfully used in the Hungarian Comparative Agendas Project (CAP) by the Institute for Political Science to classify newspaper articles.
In this paper, we present a system for human physical Activity Recognition (AR) using smartphone with embedded sensors. This paper addresses the question whether there is a comfortable way to predict human activities based on collected data from smartphone embedded gyroscope and accelerometer. Computational background of this work based on self-learning machine learning methods. In order to train the machine learning algorithms, The University of California, Irvine (UCI) dataset was used and the different models were compared. After selecting the best model further modifications were suggested in order to improve the accuracy of the model. At the end 96.88% accuracy was reached.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations鈥揷itations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright 漏 2025 scite LLC. All rights reserved.
Made with 馃挋 for researchers
Part of the Research Solutions Family.