Clinically, cancer is a complex family of diseases. From the view of molecular biology, cancer is a genetic disease resulting from abnormal gene expression. This alternation of gene expression could be resulting from DNA instability, such as translocation, amplification, deletion or point mutations. A large amplification or deletion of a chromosome region can be easily detected by two methods: loss of heterozygosity (LOH) and comparative genomic hybridization (CGH). The different gene expression pattern can be monitored by high throughput microarray analysis. Enormous data accumulated by practicing these technologies and the data pool is continuing enlarging with an amazing rate.To aid investigators mining useful information in these data deposits, new data storing and analysis tools must be developed.Two value-added databases are constructed to achieve this purpose. They contain information of genes in the instable regions of cancer cells basing on the data accumulated from LOH and CGH experiments and information of cancer cell gene expression profiles according to microarray analysis, respectively. An automatic system to retrieve interesting gene information, to compare with the known databases, to analyze and predict the protein functions, and to group the genes of the same function will be integrated into the database circuit. An automatic update system will be installed and performed after the setup of the two databases. The system keeps also the probability to modify and to accept new data obtained from any new techniques. Our goal is to help biologists to find the needles in a haystack, that is, to find the real cancer-related genes (oncogenes or tumor suppressor genes) for further research purpose.
It has been an important task of discovering frequent subsequences as particular patterns from large sequence databases generated from a variety of applications, such as biological sequence analysis. In general, the patterns to be discovered may partially and asynchronously exist in sequences, and even contain gaps. In addition, the locations and frequencies of the patterns may be of interest for the subsequent analysis. How to enumerate candidate patterns for evaluation without exponentially increasing the computation time is another concern. The modified periodicity transform is proposed to meet the requirements mentioned above. The computation time for a synthetic sequence of length 300K takes 4 seconds to mine all partial periodic patterns of length 5. With minor modification, it is able to handle asynchronous partial periodic patterns of arbitrary length. Note that the approach is in nature suited to distributed environments. A prototype system has been developed in Java for distributed computing. The system could be considered as a feature extractor in an early stage of sequence analysis
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.