We consider how to combine several independent studies of the same diagnostic test, where each study reports an estimated false positive rate (FPR) and an estimated true positive rate (TPR). We propose constructing a summary receiver operating characteristic (ROC) curve by the following steps. (i) Convert each FPR to its logistic transform U and each TPR to its logistic transform V after increasing each observed frequency by adding 1/2. (ii) For each study calculate D = V - U, which is the log odds ratio of TPR and FPR, and S = V + U, an implied function of test threshold; then plot each study's point (Si, Di). (iii) Fit a robust-resistant regression line to these points (or an equally weighted least-squares regression line), with V - U as the dependent variable. (iv) Back-transform the line to ROC space. To avoid model-dependent extrapolation from irrelevant regions of ROC space we propose defining a priori a value of FPR so large that the test simply would not be used at that FPR, and a value of TPR so low that the test would not be used at that TPR. Then (a) only data points lying in the thus defined north-west rectangle of the unit square are used in the data analysis, and (b) the estimated summary ROC is depicted only within that subregion of the unit square. We illustrate the methods using simulated and real data sets, and we point to ways of comparing different tests and of taking into account the effects of covariates.
Reports of diagnostic accuracy often differ. The authors present a method to summarize disparate reports that uses a logistic transformation and linear regression to produce a summary receiver operating characteristic curve. The curve is useful for summarizing a body of diagnostic accuracy literature, comparing technologies, detecting outliers, and finding the optimum operating point of the test. Examples from clinical chemistry and diagnostic radiology are provided. By extending the logic of meta-analysis to diagnostic testing, the method provides a new tool for technology assessment.
Clinical investigations often involve data in the form of ordered categories--e.g., "worse," "unchanged," "improved," "much improved." Comparison of two groups when the data are of this kind should not be done by the chi-square test, which wastes information and is insensitive in this context. The Wilcoxon-Mann-Whitney test provides a proper analysis. Alternatively, scores may be assigned to the categories in order, and the t-test applied. We demonstrate both approaches here. Sometimes data in ordered categories are reduced to a two-by-two table by the collapsing of the high categories into one category and the low categories into another. This practice is inefficient; moreover, it entails avoidable subjectivity in the choice of the cutting point that defines the two super-categories. The Wilcoxon-Mann-Whitney procedure (or the t-test with use of ordered scores) is preferable. A survey of research articles in Volume 306 of the New England Journal of Medicine shows many instances of ordered-category data (about 20 per cent of the articles had such data) and no instance of analysis by the preferred methods presented here. We suggest that investigators who are unfamiliar with these methods should seek the assistance of a professional statistician when they must deal with such data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.