Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called "functional domain composition" is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine According to the localization or compartment in a cell, proteins are generally classified into the following 12 categories: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Given the sequence of a protein, how can we predict which category or subcellular location it belongs to? This is certainly a very important problem because the subcellular location of a protein is closely correlated with its biological function. Although the information about protein subcellular location can be determined by conducting various experiments, that is both time consuming and costly. Because of the fact that the number of sequences entering into databanks has been rapidly increasing, e.g. in 1986 the total sequence entries in SWISS-PROT (1) was only 3,939 while the number was increased to 80,000 in 1999, the problem has become an urgent challenge. Particularly, it is anticipated that many more new protein sequences will be derived soon because of the recent success of the human genome project, which has provided an enormous amount of genomic information in the form of 3 billion base pairs assembled into tens of thousands of genes. Therefore, the challenge will become even more urgent and critical. Actually, many efforts have been made trying to develop some computational methods for quickly predicting the subcellular locations of proteins (2-13). It is instructive to point out that, of these algorithms, most are based on the amino acid composition alone without including any sequence-order effects, and some (9, 12, 13) are based on the pseudo amino acid composition that incorporated partial sequence-order effects. To further improve the prediction quality, a logical and key step would be to find an effective way to incorporate the sequence-order effects. The present study was initiated in an attempt to explore a different approach to incorporate these kinds of effects. The core of the new approach is ...
Identifying essential genes in a given organism is important for research on their fundamental roles in organism survival. Furthermore, if possible, uncovering the links between core functions or pathways with these essential genes will further help us obtain deep insight into the key roles of these genes. In this study, we investigated the essential and non-essential genes reported in a previous study and extracted gene ontology (GO) terms and biological pathways that are important for the determination of essential genes. Through the enrichment theory of GO and KEGG pathways, we encoded each essential/non-essential gene into a vector in which each component represented the relationship between the gene and one GO term or KEGG pathway. To analyze these relationships, the maximum relevance minimum redundancy (mRMR) was adopted. Then, the incremental feature selection (IFS) and support vector machine (SVM) were employed to extract important GO terms and KEGG pathways. A prediction model was built simultaneously using the extracted GO terms and KEGG pathways, which yielded nearly perfect performance, with a Matthews correlation coefficient of 0.951, for distinguishing essential and non-essential genes. To fully investigate the key factors influencing the fundamental roles of essential genes, the 21 most important GO terms and three KEGG pathways were analyzed in detail. In addition, several genes was provided in this study, which were predicted to be essential genes by our prediction model. We suggest that this study provides more functional and pathway information on the essential genes and provides a new way to investigate related problems.
Paramutation involves homologous sequence communication that leads to meiotically heritable transcriptional silencing. We demonstrate that mop2 (mediator of paramutation2), which alters paramutation at multiple loci, encodes a gene similar to Arabidopsis NRPD2/E2, the second-largest subunit of plant-specific RNA polymerases IV and V. In Arabidopsis, Pol-IV and Pol-V play major roles in RNA–mediated silencing and a single second-largest subunit is shared between Pol-IV and Pol-V. Maize encodes three second-largest subunit genes: all three genes potentially encode full length proteins with highly conserved polymerase domains, and each are expressed in multiple overlapping tissues. The isolation of a recessive paramutation mutation in mop2 from a forward genetic screen suggests limited or no functional redundancy of these three genes. Potential alternative Pol-IV/Pol-V–like complexes could provide maize with a greater diversification of RNA–mediated transcriptional silencing machinery relative to Arabidopsis. Mop2-1 disrupts paramutation at multiple loci when heterozygous, whereas previously silenced alleles are only up-regulated when Mop2-1 is homozygous. The dramatic reduction in b1 tandem repeat siRNAs, but no disruption of silencing in Mop2-1 heterozygotes, suggests the major role for tandem repeat siRNAs is not to maintain silencing. Instead, we hypothesize the tandem repeat siRNAs mediate the establishment of the heritable silent state—a process fully disrupted in Mop2-1 heterozygotes. The dominant Mop2-1 mutation, which has a single nucleotide change in a domain highly conserved among all polymerases (E. coli to eukaryotes), disrupts both siRNA biogenesis (Pol-IV–like) and potentially processes downstream (Pol-V–like). These results suggest either the wild-type protein is a subunit in both complexes or the dominant mutant protein disrupts both complexes. Dominant mutations in the same domain in E. coli RNA polymerase suggest a model for Mop2-1 dominance: complexes containing Mop2-1 subunits are non-functional and compete with wild-type complexes.
The Anatomical Therapeutic Chemical (ATC) classification system, recommended by the World Health Organization, categories drugs into different classes according to their therapeutic and chemical characteristics. For a set of query compounds, how can we identify which ATC-class (or classes) they belong to? It is an important and challenging problem because the information thus obtained would be quite useful for drug development and utilization. By hybridizing the informations of chemical-chemical interactions and chemical-chemical similarities, a novel method was developed for such purpose. It was observed by the jackknife test on a benchmark dataset of 3,883 drug compounds that the overall success rate achieved by the prediction method was about 73% in identifying the drugs among the following 14 main ATC-classes: (1) alimentary tract and metabolism; (2) blood and blood forming organs; (3) cardiovascular system; (4) dermatologicals; (5) genitourinary system and sex hormones; (6) systemic hormonal preparations, excluding sex hormones and insulins; (7) anti-infectives for systemic use; (8) antineoplastic and immunomodulating agents; (9) musculoskeletal system; (10) nervous system; (11) antiparasitic products, insecticides and repellents; (12) respiratory system; (13) sensory organs; (14) various. Such a success rate is substantially higher than 7% by the random guess. It has not escaped our notice that the current method can be straightforwardly extended to identify the drugs for their 2nd-level, 3rd-level, 4th-level, and 5th-level ATC-classifications once the statistically significant benchmark data are available for these lower levels.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.