Analysis procedures are needed to extract useful information from the large amount of gene expression data that is becoming available. This work describes a set of analytical tools and their application to yeast cell cycle data. The components of our approach are (1) a similarity measure that reduces the number of false positives, (2) a new clustering algorithm designed specifically for grouping gene expression patterns, and (3) an interactive graphical cluster analysis tool that allows user feedback and validation. We use the clusters generated by our algorithm to summarize genome-wide expression and to initiate supervised clustering of genes into biologically meaningful groups.The advent of oligonucleotide arrays and cDNA microarrays (Fodor et al. 1993;Schena et al. 1995;Lockhart et al. 1996) has enabled biologists to measure the expression levels of thousands of genes in parallel. These technologies have raised many exciting questions in experimental design and data analysis. One type of experiment involves monitoring gene expression while a cell undergoes some biological process. The yeast Saccharomyces cerevisiae makes an excellent organism for this type of experiment because its genome has been sequenced and all of the ORFs have been determined. Some of the processes in yeast that have recently been explored are the diauxic shift (DeRisi et al. 1997), sporulation (Chu et al. 1998) and the cell cycle (Cho et al. 1998;Spellman et al. 1998). Each study determines the expression level of every ORF at a series of time points. The resulting data set must be analyzed to determine the roles of specific genes in the process of interest.Once the expression levels have been determined by experimental means, it is important to find genes with similar expression patterns (coexpressed genes). There are two reasons for interest in coexpressed genes. First, there is evidence that many functionally related genes are coexpressed Spellman et al. 1998). For example, genes coding for elements of a protein complex are likely to have similar expression patterns. Figure 1 illustrates one such case. Hence, grouping ORFs with similar expression levels can reveal the function of previously uncharacterized genes. The second reason for interest in coexpressed genes is that coexpression may reveal much about the genes' regulatory systems. For example, if a single regulatory system controls two genes, then we might expect the genes to be coexpressed. In general, there is likely to be a relationship between coexpression and coregulation. In this work, we present a systematic analysis procedure to identify, group, and analyze coexpressed genes. The procedure is applied to the seventeen time-point mitotic cell cycle data (Cho et al. 1998) available at http://genomics.stanford.edu/yeast/cellcycle.html.
Processing the DataA brief description of the cell cycle experiment is necessary to understand the data set. The detailed experimental protocol is given in the original work (Cho et al. 1998). Cells in a yeast culture were synchronized, and c...
Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of highconfidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million singlenucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.