Abstract. Large-scale feature selection is one of the most important fields in the big data domain that can solve real data problems, such as bioinformatics, where it is necessary to process huge amount of data. The efficiency of existing feature selection algorithms significantly downgrades, if not totally inapplicable, when data size exceeds hundreds of gigabytes, because most feature selection algorithms are designed for centralized computing architecture. For that, distributed computing techniques, such as MapReduce can be applied to handle very large data. Our approach is to scale the existing method for feature selection, Kmeans clustering and Signal to Noise Ratio (SNR) combined with optimization technique as Binary Particle Swarm Optimization (BPSO). The proposed method is divided into two stages. In the first stage, we have used parallel Kmeans on MapReduce for clustering features, and then we have applied iterative MapReduce that implement parallel SNR ranking for each cluster. After, we have selected the top ranked feature from each cluster. The top scored features from each cluster are gathered and a new feature subset is generated. In the second stage, the new feature subset is used as input to the proposed BPSO based on MapReduce which provides an optimized feature subset. The proposed method is implemented in a distributed environment, and its efficiency is illustrated through analyzing practical problems such as biomarker discovery.
Big data is coming with new challenges in security; involve the three aspects of security: (confidentiality, availability, integrity) and privacy. These chal-lenges are due to the characteristics 5V of data in Big data: velocity, variety, volume, value, and veracity. And depend on several level of security: network, data, applica-tion, and authentication. Furthermore, big data is also promising security. The huge amount of data provides a more security information like data logs. Moreover, big da-ta analysis can be applied to security. Many theories for big data security are pro-posed in literature, covering the different aspects of security and privacy. Recently, different schemes and frameworks are introduced to reach high level of security in big data, based on different security theories. In this paper, we discuss different challeng-es in big data security and privacy, and we introduce recent security theories and works in this filed. A comparative study of latest advances in big data security and privacy is presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.