Bias and Precision of the Squared Canonical Correlation Coefficient Under Nonnormal Data Condition

Leach, Lesley F.; Henson, Robin K.

doi:10.22237/jmasm/1398917220

Cited by 10 publications

(16 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous literature, based on small numbers of specific datasets or Monte-Carlo simulations, has suggested using between 10 and 70 samples per feature for CCA [11,15,17]. Beyond that, our calculator is able to suggest sample sizes for the given characteristics of a dataset, and can do so for both CCA and PLS.…”

Section: Discussionmentioning

confidence: 99%

“…To address that, we used empirical datasets and found similar sample-size dependencies as in synthetic datasets. In an investigation of the stability of CCA for non-normal data varying kurtosis had minimal effects [17]. We then assumed the existence of a single cross-modality axis of association, but in practice several ones might be present.…”

Section: /70mentioning

confidence: 99%

“…Therefore, it is important to assess how stability of CCA/PLS solutions depends on dataset properties.Instability of CCA/PLS solutions is in principle a known issue [4,14]. Prior studies using a small number of specific datasets or Monte-Carlo simulations have suggested to use between 10 and 70 samples per feature in order to obtain stable models [11,15,17]. However, it remains unclear how the various elements of CCA/PLS solutions (including association strengths, weights, and statistical power) differentially depend on dataset properties and sampling error, nor how CCA and PLS as distinct methods may exhibit differential robustness across data regimes.…”

mentioning

confidence: 99%

See 2 more Smart Citations

On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations

Helmer

Warrington

Mohammadi‐Nejad

et al. 2020

Preprint

112

View full text Add to dashboard Cite

Associations between high-dimensional datasets, each comprising many features, can be discovered through multivariate statistical methods, like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). CCA and PLS are widely used methods which reveal which features carry the association. Despite the longevity and popularity of CCA/PLS approaches, their application to high-dimensional datasets raises critical questions about the reliability of CCA/PLS solutions. In particular, overfitting can produce solutions that are not stable across datasets, which severely hinders their interpretability and generalizability. To study these issues, we developed a generative model to simulate synthetic datasets with multivariate associations, parameterized by feature dimensionality, data variance structure, and assumed latent association strength. We found that resulting CCA/PLS associations could be highly inaccurate when the number of samples per feature is relatively small. For PLS, the profiles of feature weights exhibit detrimental bias toward leading principal component axes. We confirmed these model trends in state-ofthe-art datasets containing neuroimaging and behavioral measurements in large numbers of subjects, namely the Human Connectome Project (n ≈ 1000) and UK Biobank (n = 20000), where we found that only the latter comprised enough samples to obtain stable estimates. Analysis of the neuroimaging literature using CCA to map brain-behavior relationships revealed that the commonly employed sample sizes yield unstable CCA solutions. Our generative modeling framework provides a calculator of dataset properties required for stable estimates. Collectively, our study characterizes dataset properties needed to limit the potentially detrimental effects of overfitting on stability of CCA/PLS solutions, and provides practical recommendations for future studies.Significance StatementScientific studies often begin with an observed association between different types of measures. When datasets comprise large numbers of features, multivariate approaches such as canonical correlation analysis (CCA) and partial least squares (PLS) are often used. These methods can reveal the profiles of features that carry the optimal association. We developed a generative model to simulate data, and characterized how obtained feature profiles can be unstable, which hinders interpretability and generalizability, unless a sufficient number of samples is available to estimate them. We determine sufficient sample sizes, depending on properties of datasets. We also show that these issues arise in neuroimaging studies of brain-behavior relationships. We provide practical guidelines and computational tools for future CCA and PLS studies.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: /70mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations

Helmer

Warrington

Mohammadi‐Nejad

et al. 2020

Preprint

112

View full text Add to dashboard Cite

show abstract

“…18 The canonical correlation values are considered quite high for the effect size. 19 Blood glutamine levels were found elevated in acute and chronic hepatitis, liver cirrhosis, and hepatocellular carcinoma 20. Oxidative stress, induced by hepatitis viruses, is one of the triggers of the hepatocellular carcinoma (HCC) associated with chronic hepatitis. 21 The hepatitis virus (HV) induces glutaminolysis to help viral replication, but this predisposes the cells to cancer transformation.…”

Section: Discussionmentioning

confidence: 99%

Blood Levels of Glutamine and Nitrotyrosine in Patients with Chronic Viral Hepatitis

Murad¹,

Tayeb²,

Mosli³

et al. 2021

IJGM

View full text Add to dashboard Cite

“…CCA), r 2 type effect sizes are the first point for considering ( 19 ). Reporting results only with P values (without effect sizes) has little or no information about the importance of results ( 20 ). Our study’s statistical significance and effect sizes demonstrate that there is a remarkable relationship between our variable sets.…”

Section: Discussionmentioning

confidence: 99%

Application of Canonical Correlation Analysis for Detecting Risk Factors Leading to Recurrence of Breast Cancer

Sadoughi

Afshar

Olfatbakhsh

et al. 2016

Iran Red Crescent Med J

View full text Add to dashboard Cite

Background:Advances in treatment options of breast cancer and development of cancer research centers have necessitated the collection of many variables about breast cancer patients. Detection of important variables as predictors and outcomes among them, without applying an appropriate statistical method is a very challenging task. Because of recurrent nature of breast cancer occurring in different time intervals, there are usually more than one variable in the outcome set. For the prevention of this problem that causes multicollinearity, a statistical method named canonical correlation analysis (CCA) is a good solution.Objectives:The purpose of this study was to analyze the data related to breast cancer recurrence of Iranian females using the CCA method to determine important risk factors.Patients and Methods:In this cross-sectional study, data of 584 female patients (mean age of 45.9 years) referred to Breast Cancer Research Center (Tehran, Iran) were analyzed anonymously. SPSS and NORM softwares (2.03) were used for data transformation, running and interpretation of CCA and replacing missing values, respectively. Data were obtained from Breast Cancer Research Center, Tehran, Iran.Results:Analysis showed seven important predictors resulting in breast cancer recurrence in different time periods. Family history and loco-regional recurrence more than 5 years after diagnosis were the most important variables among predictors and outcomes sets, respectively.Conclusions:Canonical correlation analysis can be used as a useful tool for management and preparing of medical data for discovering of knowledge hidden in them.

show abstract

Bias and Precision of the Squared Canonical Correlation Coefficient Under Nonnormal Data Condition

Cited by 10 publications

References 49 publications

On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations

On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations

Blood Levels of Glutamine and Nitrotyrosine in Patients with Chronic Viral Hepatitis

Application of Canonical Correlation Analysis for Detecting Risk Factors Leading to Recurrence of Breast Cancer

Contact Info

Product

Resources

About