The term 'Big Data' has spread rapidly in the framework of Data Mining and Business Intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. When the scalability term is considered, usually traditional parallel-type solutions are contemplated, such as the Message Passing Interface or high performance and distributed Database Management Systems. Nowadays there is a new paradigm that has gained popularity over the latter due to the number of benefits it offers. This model is Cloud Computing, and among its main features we has to stress its elasticity in the use of computing resources and space, less management effort, and flexible costs. In this article, we provide an overview on the topic of Big Data, and how the current problem can be addressed from the perspective of Cloud Computing and its programming frameworks. In particular, we focus on those systems for large-scale analytics based on the MapReduce scheme of data are recorded everyday resulting in a large volume of information; this incoming information arrives at a high rate and its processing involves real-time requirements implying a high velocity; we may find a wide variety of structured, semi-structured, and unstructured data; and data have to be cleaned before the integration into the system in order to maintain veracity.1 This 4V property is one of the most widespread definitions of what is known as the Big Data problem, 2,3 which has become a hot topic of interest within academia and corporations. The current explosion of data that is being generated is due to three main reasons 4 : (1) hundreds of applications such as mobile sensors, social media services, and other related devices are collecting information continuously; (2) storage capacity has improved so much that collecting data is cheaper than ever, making preferable to buy more storage space rather than deciding what to delete; (3) Machine Learning and information retrieval approaches have reached a significant improvement in the last years, thus enabling the acquisition of a higher degree of knowledge from data. 5,6 Corporations are aware of these developments. Gaining critical business insights by querying and analyzing such massive amounts of data is becoming a necessity. This issue is known as Business Intelligence (BI), 7,8 which refers to decision support systems that combine data gathering, data storage, and knowledge management with analysis to provide input to the decision process. 9 Regarding the former issues, a new concept appears as a more general field, integrating data warehousing, Data Mining (DM), and data visualization for Business Analytics. This topic is known as Data Science. 10,11The data management and analytics carried out in conventional database systems (and other rela...
Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a "de facto" solution. Basically, it carries out a "divide-andconquer" distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current
The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods.In this work we describe the methodology that won the ECBDL'14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a thresh-
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.