High-throughput technologies facilitate the measurement of vast numbers of biological variables, thereby providing enormous amounts of multivariate data with which to model biological processes. 1 In translational genomics, phenotype classification via gene expression promises highly discriminatory molecular-based diagnosis, and regulatory-network modeling offers the potential to develop therapeutic strategies based on genomic decision making using classical engineering disciplines such as control theory. 2 Yet one must recognize the obstacles inherent in dealing with extremely large numbers of interacting variables in a nonlinear, stochastic, and redundant system that reacts aggressively to any attempt to probe it-a living system. In particular, large data sets may have the perverse effect of limiting the amount of scientific information that can be extracted, because the ability to build models with scientific validity is negatively impacted by an increasing ratio between the number of variables and the sample size. Our specific interest is in how this dimensionality problem creates the need for feature selection while making feature-selection algorithms less reliable with small samples.Two well-appreciated issues tend to confound feature selection: redundancy and multivariate prediction. Both of these can be illustrated by taking a naïve approach to feature selection by considering all features in isolation, ranking them on the basis of their individual predictive capabilities, selecting some features with the highest individual performances, and then applying a standard classification rule to these features, the reasoning being that these are the best predictors of the class. Redundancy arises because the top-performing features might be strongly related-say, by the fact that they share a similar regulatory pathway-and using more than one or two of them may provide little added benefit. The issue of multivariate prediction arises because top-performing single features may not be significantly more beneficial when used in combination with other features, whereas features that perform poorly when used alone may provide outstanding classifi-Data preprocessing is an indispensable step in effective data analysis. It prepares data for data mining and machine learning, which aim to turn data into business intelligence or knowledge. Feature selection is a preprocessing technique commonly used on highdimensional data. Feature selection studies how to select a subset or list of attributes or variables that are used to construct models describing data. Its purposes include reducing dimensionality, removing irrelevant and redundant features, reducing the amount of data needed for learning, improving algorithms' predictive accuracy, and increasing the constructed models' comprehensibility.Feature selection is different from feature extraction (for example, principal component analysis, singular-value decomposition, manifold learning, and factor analysis), which creates new (ex-tracted) features that are combinations of th...
Sleeping sickness is a fatal disease caused by the protozoan parasite Trypanosoma brucei (Tb). Inosine-5'-monophosphate dehydrogenase (IMPDH) has been proposed as a potential drug target, since it maintains the balance between guanylate deoxynucleotide and ribonucleotide levels that is pivotal for the parasite. Here we report the structure of TbIMPDH at room temperature utilizing free-electron laser radiation on crystals grown in living insect cells. The 2.80 Å resolution structure reveals the presence of ATP and GMP at the canonical sites of the Bateman domains, the latter in a so far unknown coordination mode. Consistent with previously reported IMPDH complexes harboring guanosine nucleotides at the second canonical site, TbIMPDH forms a compact oligomer structure, supporting a nucleotidecontrolled conformational switch that allosterically modulates the catalytic activity. The oligomeric TbIMPDH structure we present here reveals the potential of in cellulo crystallization to identify genuine allosteric co-factors from a natural reservoir of specific compounds.
Clustering is an important data exploration task. Its use in data mining is growing very fast. Traditional clustering algorithms which no longer cater to the data mining requirements are mod#ed increasingly. Clustering algorithms are numerous which can be divided in several categories. Two prominent categories are distance-based and density-based (e.g. K-means and DBSCAN, respectively). While K-means is fast, easy to implement, and converges to local optima almost surely, but it is also easily affected by noise. On the other hand, while density-based clustering canjind arbitrary shape clusters and handle noise well, but it is also slow in comparison due to neighborhood search for each data point, and faces difficulty in setting density threshold properly. In this paper; we propose BRIDGE that eflciently merges the two by exploiting the advantages of one to counter the limitations of the other and vice versa. BRIDGE enables DBSCAN to handle very large data efficiently and improves the quality of K-means clusters by removing the noisy points. It also helps the user in setting the density threshold parameter properly. We further show that other clustering algorithms can be merged using similar strategy. An example given in the paper merges BIRCH clustering with DBSCAN.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.