Data collection from online platforms, such as Amazon’s Mechanical Turk (MTurk), has become popular in clinical research. However, there are also concerns about the representativeness and the quality of these data for clinical studies. The present work explores these issues in the specific case of major depression. Analyses of two large data sets gathered from MTurk (Sample 1: N = 2,692; Sample 2: N = 2,354) revealed two major findings: First, failing to screen for inattentive and fake respondents inflates the rates of major depression artificially and significantly (by 18.5%–27.5%). Second, after cleaning the data sets, depression in MTurk is still 1.6 to 3.6 times higher than general population estimates. Approximately half of this difference can be attributed to differences in the composition of MTurk samples and the general population (i.e., sociodemographics, health, and physical activity lifestyle). Several explanations for the other half are proposed, and practical data-quality tools are provided.
Clinical psychological research studies often require individuals with specific characteristics. The Internet can be used to recruit broadly, enabling the recruitment of rare groups such as people with specific psychological disorders. However, Internet-based research relies on participant self-report to determine eligibility, and thus, data quality depends on participant honesty. For those rare groups, even low levels of participant dishonesty can lead to a substantial proportion of fraudulent survey responses, and all studies will include careless respondents who do not pay attention to questions, do not understand them, or provide intentionally wrong responses. Poor-quality responses should be thought of as categorically different from high-quality responses. Including these responses will lead to the overestimation of the prevalence of rare groups and incorrect estimates of scale reliability, means, and correlations between constructs. We demonstrate that for these reasons, including poor-quality responses-which are usually positively skewed-will lead to several data-quality problems including spurious associations between measures. We provide recommendations about how to ensure that fraudulent participants are detected and excluded from self-report research studies.
Detection of suicide risk is a highly prioritized, yet complicated task. Five decades of research have produced predictions slightly better than chance (AUCs = 0.56–0.58). In this study, Artificial Neural Network (ANN) models were constructed to predict suicide risk from everyday language of social media users. The dataset included 83,292 postings authored by 1002 authenticated Facebook users, alongside valid psychosocial information about the users. Using Deep Contextualized Word Embeddings for text representation, two models were constructed: A Single Task Model (STM), to predict suicide risk from Facebook postings directly (Facebook texts → suicide) and a Multi-Task Model (MTM), which included hierarchical, multilayered sets of theory-driven risk factors (Facebook texts → personality traits → psychosocial risks → psychiatric disorders → suicide). Compared with the STM predictions (0.621 ≤ AUC ≤ 0.629), the MTM produced significantly improved prediction accuracy (0.697 ≤ AUC ≤ 0.746), with substantially larger effect sizes (0.729 ≤ d ≤ 0.936). Subsequent content analyses suggested that predictions did not rely on explicit suicide-related themes, but on a range of text features. The findings suggest that machine learning based analyses of everyday social media activity can improve suicide risk predictions and contribute to the development of practical detection tools.
Background: Detection of suicide risk is a highly prioritized, yet complicated task. In fact, five decades of suicide research produced predictions that were only marginally better than chance (AUCs = 0.56 – 0.58). Advanced machine learning methods open up new opportunities for progress in mental health research. In the present study, Artificial Neural Network (ANN) models were constructed to predict externally valid suicide risk from everyday language of social media users. Method: The dataset included 83,292 postings authored by 1,002 authenticated, active Facebook users, alongside clinically valid psychosocial information about the users. Results: Using Deep Contextualized Word Embeddings (CWEs) for text representation, two models were constructed: A Single Task Model (STM), to predict suicide risk from Facebook postings directly (Facebook texts → suicide) and a Multi-Task Model (MTM), which included hierarchical, multilayered sets of theory-driven risk factors (Facebook texts → personality traits → psychosocial risks → psychiatric disorders → suicide). Compared with the STM predictions (.606 ≤ AUC ≤ .608), the MTM produced improved prediction accuracy (.690 ≤ AUC ≤ .759), with substantially larger effect sizes (.701 ≤ d ≤ .994). Subsequent content analyses suggest that predictions did not rely on explicit suicide-related themes, but on a wide range of content. Conclusions: Advanced machine learning methods can improve our ability to predict suicide risk from everyday social media activities. The knowledge generated by this research may eventually lead to the development of more accurate and objective detection tools and get individuals the help they need in time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.