In the era of the Internet of Things and big data, we are faced with the management of a flood of information. The complexity and amount of data presented to the decision-maker are enormous, and existing methods often fail to derive nonredundant information quickly. Thus, the selection of the most satisfactory set of solutions is often a struggle. This article investigates the possibilities of using the entropy measure as an indicator of data difficulty. To do so, we focus on real-world data covering various fields related to markets (the real estate market and financial markets), sports data, fake news data, and more. The problem is twofold: First, since we deal with unprocessed, inconsistent data, it is necessary to perform additional preprocessing. Therefore, the second step of our research is using the entropy-based measure to capture the nonredundant, noncorrelated core information from the data. Research is conducted using well-known algorithms from the classification domain to investigate the quality of solutions derived based on initial preprocessing and the information indicated by the entropy measure. Eventually, the best 25% (in the sense of entropy measure) attributes are selected to perform the whole classification procedure once again, and the results are compared.
The analysis of sports data and the possibility of using machine learning in the prediction of sports results is an increasingly popular topic of research and application. The main problem, apart from choosing the right algorithm, is to obtain data that allow for effective prediction. The article presents a comprehensive KDD (Knowledge Discovery in Databases) approach that allows for the appropriate preparation of data for sports prediction on sports data. The first part of the article covers the subject of KDD and sports data. The next section presents an approach to developing a dataset on top football leagues. The developed datasets are the main purpose of the article and have been made publicly available to the research community. In the latter part of the article, an experiment with the results based on heterogeneous groups of classifiers and the developed datasets is presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.