Background
The accurate prediction of biological features from genomic data is paramount for precision medicine and sustainable agriculture. For decades, neural network models have been widely popular in fields like computer vision, astrophysics and targeted marketing given their prediction accuracy and their robust performance under big data settings. Yet neural network models have not made a successful transition into the medical and biological world due to the ubiquitous characteristics of biological data such as modest sample sizes, sparsity, and extreme heterogeneity.
Results
Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets. Mainly, recurrent neural network models outperform convolutional neural network models in terms of prediction accuracy, overfitting and transferability across the datasets under study.
Conclusions
While the perspective of a robust out-of-the-box neural network model is out of reach, we identify certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers.
The emergence of cloud computing in big data era has exerted a substantial impact on our daily lives. The conventional reliability-aware workflow scheduling (RWS) is capable of improving or maintaining system reliability by fault tolerance techniques such as replication and checkpointing based recovery. However, the fault tolerant techniques used in RWS would inevitably result in higher system energy consumption, longer execution time, and worse thermal profiles that would in turn lead to a decreased hardware lifespan. To mitigate the lifetime-energy-makespan issues of RWS in cloud computing systems for big data, we propose a novel methodology that decomposes the complicated studied problem. In this methodology, we provide three procedures to solve the energy consumption, execution makespan, and hardware lifespan issues in cloud systems executing real-time workflow applications. We implement numerous simulation experiments to validate the proposed methodology for RWS. Simulation results clearly show that the proposed RWS strategies outperform comparative approaches in reducing energy consumption, shortening execution makespan, and prolonging system lifespan while maintaining high reliability. The improvements on energy saving, reduction on makespan, and increase in lifespan can be up to 23.8%, 18.6%, and 69.2%, respectively. Results also show the potentiality of the proposed method to develop a distributed analysis system for big data that serves satellite signal processing, earthquake early warning, and so on.
The accurate prediction of biological features from genomic data is paramount for precision medicine, sustainable agriculture and climate change research. For decades, neural network models have been widely popular in fields like computer vision, astrophysics and targeted marketing given their prediction accuracy and their robust performance under big data settings. Yet neural network models have not made a successful transition into the medical and biological world due to the ubiquitous characteristics of biological data such as modest sample sizes, sparsity, and extreme heterogeneity. Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets. While the perspective of a robust out-of-the-box neural network model is out of reach, we identify certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.