Statistical issues in the comparison of quantitative imaging biomarker algorithms using pulmonary nodule volume as an example

Obuchowski, Nancy A.; Barnhart, Huiman X.; Buckler, Andrew J.; Pennello, Gene; Wang, Xiaofeng; Kalpathy–Cramer, Jayashree; Kim, Hyun J. Grace; Reeves, Anthony P.

doi:10.1177/0962280214537392

Cited by 57 publications

(70 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, when normality is assumed, the CCC equals the ICC [34,36,37]. Although we report the CCC here, this measure does suffer from notable deficiencies common to many of these correlation measures [39][40][41] in that they are very sensitive to sample heterogeneity and that they are aggregate measures (thus making it difficult to separate systematic bias from issues in precision or large random errors). Thus, it would not be valid to compare our CCC measures to those measured on a different set of nodules with a different range of volumes.…”

Section: ] Precisionmentioning

confidence: 97%

See 1 more Smart Citation

A Comparison of Lung Nodule Segmentation Algorithms: Methods and Results from a Multi-institutional Study

et al. 2016

Self Cite

View full text Add to dashboard Cite

Tumor volume estimation, as well as accurate and reproducible borders segmentation in medical images, are important in the diagnosis, staging, and assessment of response to cancer therapy. The goal of this study was to demonstrate the feasibility of a multi-institutional effort to assess the repeatability and reproducibility of nodule borders and volume estimate bias of computerized segmentation algorithms in CT images of lung cancer, and to provide results from such a study. The dataset used for this evaluation consisted of 52 tumors in 41 CT volumes (40 patient datasets and 1 dataset containing scans of 12 phantom nodules of known volume) from five collections available in The Cancer Imaging Archive. Three academic institutions developing lung nodule segmentation algorithms submitted results for three repeat runs for each of the nodules. We compared the performance of lung nodule segmentation algorithms by assessing several measurements of spatial overlap and volume measurement. Nodule sizes varied from 29 μl to 66 ml and demonstrated a diversity of shapes. Agreement in spatial overlap of segmentations was significantly higher for multiple runs of the same algorithm than between segmentations generated by different algorithms (p < 0.05) and was significantly higher on the phantom dataset compared to the other datasets (p < 0.05). Algorithms differed significantly in the bias of the measured volumes of the phantom nodules (p < 0.05) underscoring the need for assessing performance on clinical data in addition to phantoms. Algorithms that most accurately estimated nodule volumes were not the most repeatable, emphasizing the need to evaluate both their accuracy and precision. There were considerable differences between algorithms, especially in a subset of heterogeneous nodules, underscoring the recommendation that the same software be used at all time points in longitudinal studies.

show abstract

Section: ] Precisionmentioning

confidence: 97%

“…Table 2 summarizes the metrics used in this study. For more details, readers are referred to a series of papers published on statistical methods for quantitative imaging biomarkers [31][32][33]41].…”

Section: Spatial Overlapmentioning

confidence: 99%

A Comparison of Lung Nodule Segmentation Algorithms: Methods and Results from a Multi-institutional Study

et al. 2016

Self Cite

View full text Add to dashboard Cite

show abstract

“…More detail is provided by Raunig et al (4). Investigators often want to compare the technical performance of two or more competing imaging procedures to assess the typical performance of the procedures, to identify the best procedure, to test the noninferiority of a procedure relative to a standard procedure, or to identify procedures that provide similar measurements (6). Table 7 summarizes some common research questions asked in QIB procedure comparison studies and possible study designs used with each.…”

Section: Metric Commentmentioning

confidence: 99%

“…Comparison of intraclass correlation coefficients estimated from groups of subjects sampled from different populations can be misleading because intraclass correlation coefficients are scaled relative to the subjects in the study sample; thus, comparisons based on different populations can be invalid (5,6). Within-subject coefficient of variance The within-subject coefficient of variance is the standard deviation of the replicate measures (within-subject standard deviation) divided by the mean.…”

Section: Intraclass Correlation Coefficientmentioning

confidence: 99%

“…Otherwise, comparisons of the measurement values with the reference values will produce significantly different results from comparisons with the true values. Simulation studies from Obuchowski et al (6) indicate that the intraclass correlation coefficient that reflects the concordance between the reference values and the SPECIAL REPORT: Metrology Standards for Quantitative Imaging Biomarkers Sullivan et al but sources for laboratory assays recommend at least five to seven levels (17). In phantom studies, these should be appropriately spaced over the measuring interval to adequately characterize the bias.…”

Section: Biasmentioning

confidence: 99%

See 1 more Smart Citation

Metrology Standards for Quantitative Imaging Biomarkers

et al. 2015

View full text Add to dashboard Cite

). 2 The members of the RSNA-QIBA Metrology Working Group are listed in the Acknowledgments.Although investigators in the imaging community have been active in developing and evaluating quantitative imaging biomarkers (QIBs), the development and implementation of QIBs have been hampered by the inconsistent or incorrect use of terminology or methods for technical performance and statistical concepts. Technical performance is an assessment of how a test performs in reference objects or subjects under controlled conditions. In this article, some of the relevant statistical concepts are reviewed, methods that can be used for evaluating and comparing QIBs are described, and some of the technical performance issues related to imaging biomarkers are discussed. More consistent and correct use of terminology and study design principles will improve clinical research, advance regulatory science, and foster better care for patients who undergo imaging studies.q RSNA, 2015

show abstract

A Review on Assessing Agreement

Barnhart

2018

Wiley StatsRef: Statistics Reference Online

View full text Add to dashboard Cite

Measurements serve as the basis for evaluation in almost all scientific disciplines, especially in physical sciences, medical studies, and health care. Issues related to reliable and accurate measurement have evolved over many decades. Requiring a measurement to be identical to the truth is sometimes impractical or impossible either because (i) the truth is simply not available or is measured with some error or (ii) some tolerable error is acceptable. Concepts of agreement, including reproducibility or reliability, are often used to determine whether the measurements can be used or not for evaluation. There has been substantial statistical literature in the last several decades on assessing agreement. This article provides a critical and comprehensive review on agreement concepts and their corresponding agreement indices developed for assessing agreement among measurements made on the same subject or experimental unit. The emphasis is on the intuitive understanding of concepts and on insights into both controversies and appropriate applications for continuous and categorical data. Four examples with either continuous or categorical measurements are used for illustration and discussion.

show abstract

Statistical issues in the comparison of quantitative imaging biomarker algorithms using pulmonary nodule volume as an example

Cited by 57 publications

References 22 publications

A Comparison of Lung Nodule Segmentation Algorithms: Methods and Results from a Multi-institutional Study

A Comparison of Lung Nodule Segmentation Algorithms: Methods and Results from a Multi-institutional Study

Metrology Standards for Quantitative Imaging Biomarkers

A Review on Assessing Agreement

Contact Info

Product

Resources

About