A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
In this paper we initiate a new area of study dealing with the best way to search a possibly unbounded region for an object. The model for our search algorithms is that we must pay costs proportional to the distance of the next probe position relative to our current position. This model is meant to give a realistic cost measure for a robot moving in the plane. We also examine the e ect of decreasing the amount of a priori information given to search problems. Problems of this type are very simple analogues of non-trivial problems on searching an unbounded region, processing digitized images, and robot navigation. We show that for some simple search problems, the relative information of knowing the general direction of the goal is much higher than knowing the distance to the goal.
The efficiency of A* searching depends on the quality of the lower bound estimates of the solution cost. Pattern databases enumerate all possible subgoals required by any solution, subject to constraints on the subgoal size. Each subgoal in the database provides a tight lower bound on the cost of achieving it. For a given state in the search space, all possible subgoals are looked up in the pattern database, with the maximum cost over all lookups being the lower bound. For sliding tile puzzles, the database enumerates all possible patterns containing N tiles and, for each one, contains a lower bound on the distance to correctly move all N tiles into their correct final location. For the 15-Puzzle, iterative-deepening A* with pattern databases (N = 8) reduces the total number of nodes searched on a standard problem set of 100 positions by over 1000-fold.
Many heuristic algorithms have been proposed for graph coloring. The simplest is perhaps the greedy algorithm. Many variations have been proposed for this algorithm at various levels of sophistication, but it is generally assumed that the coloring will occur in a single attempt. We note that if a new permutation of the vertices is chosen which respects the independent sets of a previous coloring, then applying the greedy algorithm will result in a new coloring in which the number of colors used does not increase, yet may decrease. We introduce several heuristics for generating new permutations that are fast when implemented and e ective in reducing the coloring number. The resulting Iterated Greedy algorithm(IG) can obtain colorings in the range 100 to 103 on graphs in G 1000; 1 2. More interestingly, it can optimally color k-colorable graphs with k up to 60 and n = 1000. We couple this algorithm with several other coloring algorithms, including a modi ed TABU search, and one that tries to nd large independent sets using a pruned backtrack. With these combined algorithms we nd 86 and 87 colorings for G 1000; 1 2 , but nd paradoxical results on other classes. Finally, we explore the areas of di culty in probabilistic graph space under several parameterizations. We check our system on k-colorable graphs in G 300;p;k for 0:05 p 0:95 and 2 k 105. We perform many experiments on G 1000; 1 2 ;k. We nd a narrow ridge where the algorithms fail to nd the speci ed coloring, but easy success everywhere else. We test our algorithm on graphs that follow the ridge boundary for p = 1 2 for n up to 10000 vertices.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.