A New Algorithm to Optimize Maximal Information Coefficient

Chen, Yuan; Zeng, Ying; Luo, Feng; Yuan, Zheming

doi:10.1371/journal.pone.0157567

Cited by 27 publications

(30 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MIC can capture dependence between pairs of variables, including both functional and nonfunctional relationships. However, the ApproxMaxMI method provided by Reshef et al (2011) results in a larger MIC score for paired variables under finite-sample conditions (Chen et al, 2016). Here, we use the improved algorithm ChiMIC to calculate the MIC value (Chen et al, 2016).…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A Novel Method to Efficiently Highlight Nonlinearly Expressed Genes

et al. 2020

Self Cite

View full text Add to dashboard Cite

For precision medicine, there is a need to identify genes that accurately distinguish the physiological state or response to a particular therapy, but this can be challenging. Many methods of analyzing differential expression have been established and applied to this problem, such as t-test, edgeR, and DEseq2. A common feature of these methods is their focus on a linear relationship (differential expression) between gene expression and phenotype. However, they may overlook nonlinear relationships due to various factors, such as the degree of disease progression, sex, age, ethnicity, and environmental factors. Maximal information coefficient (MIC) was proposed to capture a wide range of associations of two variables in both linear and nonlinear relationships. However, with MIC it is difficult to highlight genes with nonlinear expression patterns as the genes giving the most strongly supported hits are linearly expressed, especially for noisy data. It is thus important to also efficiently identify nonlinearly expressed genes in order to unravel the molecular basis of disease and to reveal new therapeutic targets. We propose a novel nonlinearity measure called normalized differential correlation (NDC) to efficiently highlight nonlinearly expressed genes in transcriptome datasets. Validation using six real-world cancer datasets revealed that the NDC method could highlight nonlinearly expressed genes that could not be highlighted by t-test, MIC, edgeR, and DEseq2, although MIC could capture nonlinear correlations. The classification accuracy indicated that analysis of these genes could adequately distinguish cancer and paracarcinoma tissue samples. Furthermore, the results of biological interpretation of the identified genes suggested that some of them were involved in key functional pathways associated with cancer progression and metastasis. All of this evidence suggests that these nonlinearly expressed genes may play a central role in regulating cancer progression.

show abstract

Section: Methodsmentioning

confidence: 99%

“…However, the ApproxMaxMI method provided by Reshef et al (2011) results in a larger MIC score for paired variables under finite-sample conditions (Chen et al, 2016). Here, we use the improved algorithm ChiMIC to calculate the MIC value (Chen et al, 2016). The NDC score for a pair of data series x (gene) and y (phenotype) is defined as follows:…”

Section: Methodsmentioning

confidence: 99%

A Novel Method to Efficiently Highlight Nonlinearly Expressed Genes

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Given n = 100, the MIC score for independent paired variables should be zero, and the corresponding partition should be a 2 × 2 grid. However, the ApproxMaxMI algorithm tends to fall into the maximal grid size (100 0.6 ≈ 16), the corresponding partition is a 2 × 8 grid and the corresponding MIC score is 0.24, which leads to a nontrivial MIC score for independent paired variables under finite samples [40]. Recently, Chen et al [40] presented the ChiMIC algorithm, which can control the excessive grid partitions of the ApproxMaxMI algorithm.…”

Section: Datasets and Methodsmentioning

confidence: 99%

“…However, the ApproxMaxMI algorithm tends to fall into the maximal grid size (100 0.6 ≈ 16), the corresponding partition is a 2 × 8 grid and the corresponding MIC score is 0.24, which leads to a nontrivial MIC score for independent paired variables under finite samples [40]. Recently, Chen et al [40] presented the ChiMIC algorithm, which can control the excessive grid partitions of the ApproxMaxMI algorithm. Removing the maximal grid size limitation in ApproxMaxMI, ChiMIC uses a chi-square test based on a local r × 2 grid to determine whether the new endpoint should be introduced.…”

Section: Datasets and Methodsmentioning

confidence: 99%

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Zeng

Yuan

et al. 2019

Biol Direct

Self Cite

View full text Add to dashboard Cite

Background Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ 2 -DT) for donor splice site prediction. Results Using a short window size of 11 bp, χ 2 -DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ 2 -DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ 2 -DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ 2 -DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. Conclusions Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. Reviewers This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther. Electronic supplementary material The online version of this article (10.1186/s13062-019-0236-y) contains supplementary material, which is available to authorized users.

show abstract

“…Although MIC has gained considerable attention (Nature, 2012;Speed, 2011;Zhang et al, 2014), there were also several discussions about some of its properties (N. Simon, 2011;Kinney and Atwal, 2014;Reshef et al, 2014). One of the main issues resides in the computational cost of MIC's original implementation: a dynamic programming algorithm called ApproxMaxMI that several studies in the literature tried to optimize (Albanese et al, 2013;Zhang et al, 2014;Tang et al, 2014;Chen et al, 2016). Apart from these issues, all the mentioned methods need categorical data to be converted to numerical in order to be applied, which cannot be done in many cases with non-ordinal variables.…”

Section: Introductionmentioning

confidence: 99%

Clustermatch: discovering hidden relations in highly diverse kinds of qualitative and quantitative data without standardization

et al. 2018

View full text Add to dashboard Cite

Motivation: Heterogeneous and voluminous data sources are common in modern datasets, particularly in systems biology studies. For instance, in multi-holistic approaches in the fruit biology field, data sources can include a mix of measurements such as morpho-agronomic traits, different kinds of molecules (nucleic acids and metabolites) and consumer preferences. These sources not only have different types of data (quantitative and qualitative), but also large amounts of variables with possibly non-linear relationships among them. An integrative analysis is usually hard to conduct, since it requires several manual standardization steps, with a direct and critical impact on the results obtained. These are important issues in clustering applications, which highlight the need of new methods for uncovering complex relationships in such diverse repositories. Results: We designed a new method named Clustermatch to easily and efficiently perform datamining tasks on large and highly heterogeneous datasets. Our approach can derive a similarity measure between any quantitative or qualitative variables by looking on how they influence on the clustering of the biological materials under study. Comparisons with other methods in both simulated and real datasets show that Clustermatch is better suited for finding meaningful relationships in complex datasets.

show abstract

A New Algorithm to Optimize Maximal Information Coefficient

Cited by 27 publications

References 25 publications

A Novel Method to Efficiently Highlight Nonlinearly Expressed Genes

A Novel Method to Efficiently Highlight Nonlinearly Expressed Genes

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Clustermatch: discovering hidden relations in highly diverse kinds of qualitative and quantitative data without standardization

Contact Info

Product

Resources

About