The most common machine-learning methods solve supervised and unsupervised problems based on datasets where the problem’s features belong to a numerical space. However, many problems often include data where numerical and categorical data coexist, which represents a challenge to manage them. To transform categorical data into a numeric form, preprocessing tasks are compulsory. Methods such as one-hot and feature-hashing have been the most widely used encoding approaches at the expense of a significant increase in the dimensionality of the dataset. This effect introduces unexpected challenges to deal with the overabundance of variables and/or noisy data. In this regard, in this paper we propose a novel encoding approach that maps mixed-type data into an information space using Shannon’s Theory to model the amount of information contained in the original data. We evaluated our proposal with ten mixed-type datasets from the UCI repository and two datasets representing real-world problems obtaining promising results. For demonstrating the performance of our proposal, this was applied for preparing these datasets for classification, regression, and clustering tasks. We demonstrate that our encoding proposal is remarkably superior to one-hot and feature-hashing encoding in terms of memory efficiency. Our proposal can preserve the information conveyed by the original data.
The Vector Control Program in Mexico has developed operational research strategies to identify entomological and sociodemographic parameters associated with dengue transmission in order to direct targeted actions and reduce transmission. However, these strategies have limitations in establishing their relationship with landscape analysis and dengue transmission. This study provides a proof of concept of the use of unmanned aerial vehicle technology as a possible way to collect spatial information of the landscape in real time through multispectral images for the generation of a multivariate predictive model that allows for the establishment of a risk index relating sociodemographic variables with the presence of the vector in its different larval, pupal, and adult stages. With flight times of less than 30 min, RGB orthomosaics were built, where houses, roads, highways, rivers, and trails are observed in detail, as well as in areas with a strong influence of vegetation, detailing the location of the roofs or the infrastructure of the house, grass, bushes, and trees of different dimensions, with a pixel resolution level of 5 centimeters. For the risk index, we developed a methodology based on partial least squares (PLS), which takes into account the different type of variables are involved and the geographic distribution of the houses as well. Results show the spatial pattern of downtown low-risk housing, which increases as we approach the outskirts of the town. The predictive model of dengue transmission risk developed through orthomosaics can help decision makers to plan control and public health activities.
The automatic extraction of geospatial information is an important aspect of data mining. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing, which includes two important tasks: geographic entity recognition and toponym resolution. The first task could be approached through a machine learning approach, in which case a model is trained to recognize a sequence of characters (words) corresponding to geographic entities. The second task consists of assigning such entities to their most likely coordinates. Frequently, the latter process involves solving referential ambiguities. In this paper, we propose an extensible geoparsing approach including geographic entity recognition based on a neural network model and disambiguation based on what we have called dynamic context disambiguation. Once place names are recognized in an input text, they are solved using a grammar, in which a set of rules specifies how ambiguities could be solved, in a similar way to that which a person would utilize, considering the context. As a result, we have an assignment of the most likely geographic properties of the recognized places. We propose an assessment measure based on a ranking of closeness relative to the predicted and actual locations of a place name. Regarding this measure, our method outperforms OpenStreetMap Nominatim. We include other assessment measures to assess the recognition ability of place names and the prediction of what we called geographic levels (administrative jurisdiction of places).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.