Selection of relevant features and examples in machine learning

Blum, Avrim; Langley, Pat

doi:10.1016/s0004-3702(97)00063-5

Cited by 2,649 publications

(1,409 citation statements)

References 67 publications

Supporting

Mentioning

1,367

Contrasting

Unclassified

Order By: Relevance

“…However, such search is exponential in the number of radio sources and therefore intractable. Instead, we used a greedy feature selection technique [4] to select a subset of highly relevant radio sources to be used in the Euclidean distance calculation. This greedy technique, albeit not optimal, has been shown to work well in practice [4].…”

Section: Localization Algorithmsmentioning

confidence: 99%

“…Table 3 summarizes the number of fingerprints collected per floor for each of the buildings. 4 The different number of fingerprints collected per floor is the result of us increasing the number of training and testing fingerprints collected with every new building in the hope of achieving even better localization results. Ironically, as we show in Section 6.3.2, the number of training fingerprints has little bearing on the localization accuracy.…”

Section: Data Collectionmentioning

confidence: 99%

See 1 more Smart Citation

Accurate GSM Indoor Localization

Otsason

Varshavsky

LaMarca

et al. 2005

Lecture Notes in Computer Science

332

207

View full text Add to dashboard Cite

Accurate indoor localization has long been an objective of the ubiquitous computing research community, and numerous indoor localization solutions based on 802.11, Bluetooth, ultrasound and infrared technologies have been proposed. This paper presents the first accurate GSM indoor localization system that achieves median within floor accuracy of 4 m in large buildings and is able to identify the floor correctly in up to 60% of the cases and is within 2 floors in up to 98% of the cases in tall multi-floor buildings. We report evaluation results of two case studies conducted over a course of several years, with data collected from 6 buildings in 3 cities across North America. The key idea that makes accurate GSM-based indoor localization possible is the use of wide signalstrength fingerprints. In addition to the 6-strongest cells traditionally used in the GSM standard, the wide fingerprint includes readings from additional cells that are strong enough to be detected, but are too weak to be used for efficient communication. We further show that selecting a subset of highly relevant channels for fingerprinting matching out of all available channels, further improves the localization accuracy.

show abstract

Section: Localization Algorithmsmentioning

confidence: 99%

Section: Data Collectionmentioning

confidence: 99%

Accurate GSM Indoor Localization

Otsason

Varshavsky

LaMarca

et al. 2005

Lecture Notes in Computer Science

332

207

View full text Add to dashboard Cite

show abstract

“…In addition, those selected features are very important, since they can provide the novel biological knowledge and insights for biologists to further investigate how they are related to the disease phenotypes. Clearly, there are some standard feature selection techniques (Blum and Langley, 1997;Kohavi and John, 1997;Guyon and Elisseeff, 2003) and classification techniques which can automatically select important features from a large amount of input features. For example, some simple and well-known filter-based feature selection methods select features based on the relationship between two random variables.…”

Section: Introductionmentioning

confidence: 99%

Biological Feature Selection and Disease Gene Identification using New Stepwise Random Forests

Hwang¹

2017

Industrial Engineering and Management Systems

View full text Add to dashboard Cite

Identifying disease genes from human genome is a critical task in biomedical research. Important biological features to distinguish the disease genes from the non-disease genes have been mainly selected based on traditional feature selection approaches. However, the traditional feature selection approaches unnecessarily consider many unimportant biological features. As a result, although some of the existing classification techniques have been applied to disease gene identification, the prediction performance was not satisfactory. A small set of the most important biological features can enhance the accuracy of disease gene identification, as well as provide potentially useful knowledge for biologists or clinicians, who can further investigate the selected biological features as well as the potential disease genes. In this paper, we propose a new stepwise random forests (SRF) approach for biological feature selection and disease gene identification. The SRF approach consists of two stages. In the first stage, only important biological features are iteratively selected in a forward selection manner based on one-dimensional random forest regression, where the updated residual vector is considered as the current response vector. We can then determine a small set of important biological features. In the second stage, random forests classification with regard to the selected biological features is applied to identify disease genes. Our extensive experiments show that the proposed SRF approach outperforms the existing feature selection and classification techniques in terms of biological feature selection and disease gene identification.

show abstract

“…Most of the feature selection algorithms approach the task as a search problem, where each state in the search specifies a distinct subset of the possible attributes (Blum and Langley, 1997). The search procedure is combined with a criterion in order to evaluate the merit of each candidate subset of attributes.…”

Section: Introductionmentioning

confidence: 99%

“…It searches for features better suited to the mining algorithm, aiming to improve mining performance, but it also is more computationally expensive (Langley, 1994;Kohavi and John, 1997) than filter models. Feature ranking (FR), also called feature weighting (Blum and Langley, 1997;Guyon and Elisseeff, 2003), assesses individual features and assigns them weights according to their degrees of relevance, while the feature subset selection (FSS) evaluates the goodness of each found feature subset. (Unusually, some search strategies in combination with subset evaluation can provide a ranked list).…”

Section: Introductionmentioning

confidence: 99%

Incremental wrapper-based gene selection from microarray data for cancer classification

Ruíz

Riquelme

Aguilar–Ruiz

2006

Pattern Recognition

252

100

View full text Add to dashboard Cite

Gene expression microarray is a rapidly maturing technology that provides the opportunity to assay the expression levels of thousands or tens of thousands of genes in a single experiment. We present a new heuristic to select relevant gene subsets in order to further use them for the classification task. Our method is based on the statistical significance of adding a gene from a ranked-list to the final subset. The efficiency and effectiveness of our technique is demonstrated through extensive comparisons with other representative heuristics. Our approach shows an excellent performance, not only at identifying relevant genes, but also with respect to the computational cost.

show abstract

Selection of relevant features and examples in machine learning

Cited by 2,649 publications

References 67 publications

Accurate GSM Indoor Localization

Accurate GSM Indoor Localization

Biological Feature Selection and Disease Gene Identification using New Stepwise Random Forests

Incremental wrapper-based gene selection from microarray data for cancer classification

Contact Info

Product

Resources

About