Gene expression microarrays are the most commonly available source of high-throughput biological data. They are widely employed for studying many different aspects of gene regulation and function, ranging from understanding the global cell-cycle control of microorganisms to cancer in humans. Gene expression microarray experiments often generate data sets with multiple missing values. Many algorithms for gene expression data analysis require a complete data matrix and therefore, the accurate estimation of missing entries is crucial for their optimal usage. The latter has driven the development of various microarray imputation methods. However, most of these approaches are not particularly suitable for time series expression profiles. Moreover, their performance is not satisfactory for datasets with high rates of missing data or small numbers of samples. Another drawback of all these methods is that their estimation is based solely on a single expression matrix and no other additional data sources to impute the missing entries are used. Motivated by these, we propose herein an imputation algorithm that is particularly suited for the estimation of missing values in gene expression time series data using information that is contained in multiple related data sets. The proposed algorithm initially identifies an appropriate set of estimation matrices by using the Dynamic Time Warping (DTW) distance in order to measure similarities between gene expression matrices. Next it employs the same distance measure to evaluate the similarity between gene expression profiles and further applies a hybrid aggregation algorithm to combine the inter-gene similarities across the selected matrices in order to identify estimation genes. Then the expression profiles of those estimation genes are used to obtain the final imputation. The estimation accuracy of the proposed algorithm, called Integrative DTWbased Imputation (IDTWimpute), is benchmarked against that of two other imputation methods (KNNimpute and DTWimpute) in terms of root mean squared difference. In addition, the impact of the three methods on the quality of gene clustering is evaluated by using k-means and k-medoids clustering algorithms and two different cluster validation measures.
Gene expression microarrays are the most commonly available source of high-throughput biological data. Each microarray experiment is supposed to measure the gene expression levels of a set of genes in a number of different experimental conditions or time points. Integration of results from different microarray experiments to the specific analysis is an important and yet challenging problem. Direct integration of microarrays is often ineffective because of the diverse types of experiment specific variations. In this paper, we propose a new hybrid method, which is specially suited for integration analysis of time series expression data across different experiments. The proposed algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles. First for each considered time series dataset a quadratic distance matrix that contains the DTW distances calculated between the expression profiles of each gene pair is built. Then using a hybrid aggregation algorithm the obtained DTW distance matrices are transformed into a single matrix, consisting of one overall DTW distance per each gene pair. The values of the resulting matrix can be interpreted as the consensus DTW distances supported by all the experiments. These may be further analyzed and help find the relationship among the genes. The proposed method is validated on gene expression time series data coming from two independent studies examining the global cell-cycle control of gene expression in fission yeast Schizosaccharomyces pombe.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.