Ensemble Learning for Large Scale Virtual Screening on Apache Spark

Sid, Karima; Batouche, Mohamed

doi:10.1007/978-3-319-89743-1_22

Cited by 5 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Some papers [27,28] used deep learning to predict drug activity. This paper also investigates other works that used big data platforms to predict activity in virtual screening [31,32].…”

Section: 3results and Discussionmentioning

confidence: 99%

“…ET algorithm gives best accuracy results of 90% and precision of 0.86, but it takes 75 seconds. Apache Spark is used, as in [32], for different big data analytics methods (MLP, DT, NB, SVM and ET). ET gives the highest accuracy (94%) and precision (0.93), but it takes longer time than DT (Table 6).…”

Section: 3results and Discussionmentioning

confidence: 99%

“…Then, the authors selected three algorithms random forest, multilayer Perceptron, and naive based to develop the ensemble classifier and calculate the activity of ligand. In [32], authors offered a pretty new method that is based on Apache Spark and the ensemble learning model to upgrade the performance of largescale VS processes. Three classifiers are used in combination, which include SVM, multi-layer perceptron, and DT, to create the ensemble learning approach, in which the method of aggregation had the common vote.…”

Section: 2big Data Analytics Framework-based Solutionsmentioning

confidence: 99%

See 2 more Smart Citations

Traditional machine learning and big data analytics in virtual screening: a comparative study

Hussin¹,

Omar²,

Abdel-Mageid³

et al. 2020

IJACR

View full text Add to dashboard Cite

An unprecedented development in biomedical data has been observed in latest years. The capability to analyze a large portion of this data will offer many opportunities that will in turn affect the future of health care [1]. In this age, traditional storage and processing techniques are not sufficient to meet the demand and hence, computing techniques must scale to handle the huge volume of data. The main difficulty in managing these data is the speed at which they are generated, that is, data generation is much faster than the available computer resources for data analysis.

show abstract

“…Some papers [27,28] used deep learning to predict drug activity. This paper also investigates other works that used big data platforms to predict activity in virtual screening [31,32].…”

Section: 3results and Discussionmentioning

confidence: 99%

Section: 3results and Discussionmentioning

confidence: 99%

Section: 2big Data Analytics Framework-based Solutionsmentioning

confidence: 99%

See 1 more Smart Citation

Traditional machine learning and big data analytics in virtual screening: a comparative study

Hussin¹,

Omar²,

Abdel-Mageid³

et al. 2020

IJACR

View full text Add to dashboard Cite

show abstract

“…Let M be the set of remaining local models. For each data record in the input dataset D, it produces the predicted label by taking the majority voting [31] of the local models in M.…”

Section: Machine Learning Algorithms Under Logomentioning

confidence: 99%

MapReduce vs Non-MapReduce - Efficiency and Scalability in Big Data Computing

2023

Всемирный Конгресс

View full text Add to dashboard Cite

MapReduce is a popular distributed computing paradigm for processing big data in a massively parallel fashion. However, when it is used to implement and run highly iterative algorithms for analyzing distributedly stored big data, the MapReduce paradigm loses its computing efficiency and data scalability due to the communication costs occurring in iterations of the algorithm over the entire dataset. Non-MapReduce is an alternative computing paradigm that removes the communication costs when executing iterative algorithms on big data that is stored using the random sample partition data model. In the Non-MapReduce paradigm, a set of random sample data blocks are selected and loaded into the memory of computing nodes. An iterative algorithm is dispatched to each computing node and executed on local data set independently and in parallel without communications among the nodes. Afterwards, the local results are transferred to the master node for computing the final result. In this paper, we propose the LOGO computing framework, a core technology for Non-MapReduce paradigm, and demonstrate its computing performance with widely used supervised learning, unsupervised learning, and pattern mining algorithms. The experiment results show that LOGO outperforms the state-of-the-art Spark by orders of magnitude in terms of the running time. LOGO is scalable to terabyte-scale data sets with high-quality results.Аннотация. MapReduce -это популярная парадигма распределенных вычислений для обработки больших данных в массово-параллельном режиме. Однако когда он используется для реализации и запуска высокоитеративных алгоритмов анализа распределенно хранящихся больших данных, парадигма MapReduce теряет свою вычислительную эффективность и масштабируемость данных из-за затрат на связь, возникающих при итерациях алгоритма по всему набору данных. Non-MapReduce -это альтернативная парадигма вычислений, которая устраняет затраты на связь при выполнении итерационных 244 Теория систем, алгебраическая биология, искусственный интеллект: математические основы и приложения

show abstract

“…Although resampling methods are usually used to solve problems with imbalances in the class, there is little defined strategy to identify the acceptable class distribution for a particular dataset [18]. As a result, the optimal class distribution differs from one dataset to another.…”

Section: Introductionmentioning

confidence: 99%

Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Hussin¹,

Abdel-Mageid

Alkhalil

et al. 2021

Complexity

View full text Add to dashboard Cite

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.

show abstract

Ensemble Learning for Large Scale Virtual Screening on Apache Spark

Cited by 5 publications

References 20 publications

Traditional machine learning and big data analytics in virtual screening: a comparative study

Traditional machine learning and big data analytics in virtual screening: a comparative study

MapReduce vs Non-MapReduce - Efficiency and Scalability in Big Data Computing

Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Contact Info

Product

Resources

About