Collaboration between humans and machines does not necessarily lead to better outcomes.
Many important uses of AI involve augmenting humans, not replacing them. But there is not yet a widely used and broadly comparable test for evaluating the performance of these human-AI systems relative to humans alone, AI alone, or other baselines. Here we describe such a test and demonstrate its use in three ways. First, in an analysis of 79 recently published results, we find that, surprisingly, the median performance improvement ratio corresponds to no improvement at all, and the maximum improvement is only 36%. Second, we experimentally find a 27% performance improvement when 100 human programmers develop software using GPT-3, a modern, generative AI system. Finally, we find that 50 human non-programmers using GPT-3 perform the task about as well as –- and less expensively than –- the human programmers. Since neither the non-programmers nor the computer could perform the task alone, this illustrates a strong form of human-AI synergy.
The Turing test for comparing computer performance to that of humans is well known, but, surprisingly, there is no widely used test for comparing how much better human-computer systems perform relative to humans alone, computers alone, or other baselines. Here, we show how to perform such a test using the ratio of means as a measure of effect size. Then we demonstrate the use of this test in three ways. First, in an analysis of 79 recently published experimental results, we find that, surprisingly, over half of the studies find a decrease in performance, the mean and median ratios of performance improvement are both approximately 1 (corresponding to no improvement at all), and the maximum ratio is 1.36 (a 36% improvement). Second, we experimentally investigate whether a higher performance improvement ratio is obtained when 100 human programmers generate software using GPT-3, a massive, state-of-the-art AI system. In this case, we find a speed improvement ratio of 1.27 (a 27% improvement). Finally, we find that 50 human non-programmers using GPT-3 can perform the task about as well as-and less expensively than-the human programmers. In this case, neither the non-programmers nor the computer would have been able to perform the task alone, so this is an example of a very strong form of human-computer synergy. SIGNIFICANCE STATEMENTThe Turing test inspired generations of computer scientists to try to develop software as intelligent as humans. But we also need, not just more intelligent software, but more intelligent human-computer systems. The test proposed here can help scientifically evaluate and, thus, spur progress in developing such systems. For instance, contests like those that helped develop autonomous vehicles might, in this case, stimulate competition among teams of computer and social scientists developing humancomputer systems that perform far better than either people or computers alone. We believe this could benefit our economy and society by fostering the development of software to augment humans rather than just replace them. And in the long run, it might also lead to creating "superintelligent" human-computer systems.
Based on the theoretical findings from the existing literature, some policymakers and software engineers contend that algorithmic risk assessments such as the COMPAS software can alleviate the incarceration epidemic and the occurrence of violent crimes by informing and improving decisions about policing, treatment, and sentencing. Considered in tandem, these findings indicate that collaboration between humans and machines does not necessarily lead to better outcomes, and human supervision does not sufficiently address problems when algorithms err or demonstrate concerning biases. If machines are to improve outcomes in the criminal justice system and beyond, future research must further investigate their practical role: an input to human decision makers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.