There is a hierarchy of predictive value that can be extracted from data. At the top of the hierarchy are causal relationships that can be confirmed with a randomized and controlled experiment or a natural experiment. Next best is to establish known or hypothesized relationships ahead of time and then test them and estimate their relative importance. One notch lower are associations found in historical data that are tested on fresh data after considering whether or not they make sense. At the bottom of the hierarchy, with little or no value, are associations found in historical data that are not confirmed by expert opinion or tested with fresh data. Data scientists who use a “correlations are enough” approach should remember that the more data and the more searches, the more likely it is that a discovered statistical relationship is coincidental and useless.
Researchers seeking fame and funding may be tempted to go on fishing expeditions (p-hacking) or to torture the data to find novel, provocative results that will be picked up by the popular media. Provocative findings are provocative because they are novel and unexpected, and they are often novel and unexpected because they are simply not true. The publication effect (or the file drawer effect) keeps the failures hidden and have created a replication crisis. Research that gets reported in the popular media is often wrong—which fools people and undermines the credibility of scientific research.
Pattern recognition prowess served our ancestors well. However, today we are confronted by a deluge of data that are far more abstract, complicated, and difficult to interpret than were annual seasons and the sounds of predators. The number of possible patterns that can be identified relative to the number that are genuinely useful has grown exponentially—which means that the chances that a discovered pattern is useful is rapidly approaching zero. Coincidental streaks, clusters, and correlations are the norm—not the exception. Our challenge is to overcome our inherited inclination to think that all patterns are meaningful.Computer algorithms can easily identify an essentially unlimited number of phantom patterns and relationships that vanish when confronted with fresh data. The paradox of big data is that the more data we ransack for patterns, the more likely it is that what we find will be worthless. Our challenge is to overcome our inherited inclination to think that all patterns are meaningful.
Data are undeniably useful for answering many interesting and important questions, but data alone are not enough. Data without theory has been the source of a large (and growing) number of data miscues, missteps, and mishaps. We should resist the temptation to believe that data can answer all questions, and that more data means more reliable answers. Data can have errors and omissions or be irrelevant. In addition, patterns discovered in the past will vanish in the future unless there is an underlying reason for the pattern. Backtesting models in the stock market is particularly pernicious because it is so easy to find coincidental patterns that turn out to be expensive mistakes. This endemic problem has now spread far and wide because there are so much data that can be used by academic, business, and government researchers to discover phantom patterns.
Attempts to replicate reported studies often fail because the research relied on data mining—searching through data for patterns without any pre-specified, coherent theories. The perils of data mining can be exacerbated by data torturing—slicing, dicing, and otherwise mangling data to create patterns. If there is no underlying reason for a pattern, it is likely to disappear when someone attempts to replicate the study. Big data and powerful computers are part of the problem, not the solution, in that they can easily identify an essentially unlimited number of phantom patterns and relationships, which vanish when confronted with fresh data. If a researcher will benefit from a claim, it is likely to be biased. If a claim sounds implausible, it is probably misleading. If the statistical evidence sounds too good to be true, it probably is.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.