Applying machine learning algorithms to automatically infer relationships between concepts from large-scale collections of documents presents a unique opportunity to investigate at scale how human semantic knowledge is organized, how people use it to make fundamental judgments ("How similar are cats and bears?"), and how these judgments depend on the features that describe concepts (e.g., size, furriness). However, efforts to date have exhibited a substantial discrepancy between algorithm predictions and human empirical judgments. Here, we introduce a novel approach to generating embeddings for this purpose motivated by the idea that semantic context plays a critical role in human judgment. We leverage this idea by constraining the topic or domain from which documents used for generating embeddings are drawn (e.g., referring to the natural world vs. transportation apparatus). Specifically, we trained state-of-the-art machine learning algorithms using contextually-constrained text corpora (domain-specific subsets of Wikipedia articles, 50+ million words each) and showed that this procedure greatly improved predictions of empirical similarity judgments and feature ratings of contextually relevant concepts. Furthermore, we describe a novel, computationally tractable method for improving predictions of contextually-unconstrained embedding models based on dimensionality reduction of their internal representation to a small number of contextually relevant semantic features. By improving the correspondence between predictions derived automatically by machine learning methods using
The ubiquity of modern smartphones, because they are equipped with a wide range of sensors, poses a potential security risk---malicious actors could utilize these sensors to detect private information such as the keystrokes a user enters on a nearby keyboard. Existing studies have examined the ability of phones to predict typing on a nearby keyboard but are limited by the realism of collected typing data, the expressiveness of employed prediction models, and are typically conducted in a relatively noise-free environment. We investigate the capability of mobile phone sensor arrays (using audio and motion sensor data) for classifying keystrokes that occur on a keyboard in proximity to phones around a table, as would be common in a meeting. We develop a system of mixed convolutional and recurrent neural networks and deploy the system in a human subjects experiment with 20 users typing naturally while talking. Using leave-one-user-out cross validation, we find that mobile phone arrays have the ability to detect 41.8% of keystrokes and 27% of typed words correctly in such a noisy environment---even without user specific training. To investigate the potential threat of this attack, we further developed the machine learning models into a realtime system capable of discerning keystrokes from an array of mobile phones and evaluated the system's ability with a single user typing in varying conditions. We conclude that, in order to launch a successful attack, the attacker would need advanced knowledge of the table from which a user types, and the style of keyboard on which a user types. These constraints greatly limit the feasibility of such an attack to highly capable attackers and we therefore conclude threat level of this attack to be low, but non-zero.
Early detection of suicidal ideation in depressed individuals can allow for adequate medical attention and support, which in many cases is life-saving. Recent NLP research focuses on classifying, from a given piece of text, if an individual is suicidal or clinically healthy. However, there have been no major attempts to differentiate between depression and suicidal ideation, which is an important clinical challenge. Due to the scarce availability of EHR data, suicide notes, or other similar verified sources, web query data has emerged as a promising alternative. Online sources, such as Reddit, allow for anonymity that prompts honest disclosure of symptoms, making it a plausible source even in a clinical setting. However, these online datasets also result in lower performance, which can be attributed to the inherent noise in web-scraped labels, which necessitates a noiseremoval process. Thus, we propose SDCNL, a suicide versus depression classification method through a deep learning approach. We utilize online content from Reddit to train our algorithm, and to verify and correct noisy labels, we propose a novel unsupervised label correction method which, unlike previous work, does not require prior noise distribution information. Our extensive experimentation with multiple deep word embedding models and classifiers display the strong performance of the method in a new, challenging classification application. We make our code and dataset available at https://github.com/ayaanzhaque/SDCNL
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.