Ilia Sucholutsky scite author profile

Narayan

et al. 2019

Most real-world datasets, and particularly those collected from physical systems, are full of noise, packet loss, and other imperfections. However, most specification mining, anomaly detection and other such algorithms assume, or even require, perfect data quality to function properly. Such algorithms may work in lab conditions when given clean, controlled data, but will fail in the field when given imperfect data. We propose a method for accurately reconstructing discrete temporal or sequential system traces affected by data loss, using Long Short-Term Memory Networks (LSTMs). The model works by learning to predict the next event in a sequence of events, and uses its own output as an input to continue predicting future events. As a result, this method can be used for data restoration even with streamed data. Such a method can reconstruct even long sequence of missing events, and can also help validate and improve data quality for noisy data. The output of the model will be a close reconstruction of the true data, and can be fed to algorithms that rely on clean data. We demonstrate our method by reconstructing automotive CAN traces consisting of long sequences of discrete events. We show that given even small parts of a CAN trace, our LSTM model can predict future events with an accuracy of almost 90%, and can successfully reconstruct large portions of the original trace, greatly outperforming a Markov Model benchmark. We separately feed the original, lossy, and reconstructed traces into a specification mining framework to perform downstream analysis of the effect of our method on state-of-the-art models that use these traces for understanding the behavior of complex systems.

Soft-Label Dataset Distillation and Text Dataset Distillation

2021

GPT is an effective tool for multilingual psychological text analysis

Rathje¹,

Mirea

Sucholutsky³

et al. 2023

Preprint

The social and behavioral sciences have been increasingly using automated text analysis to measure psychological constructs in text. We explore whether GPT, the large-language model underlying the artificial intelligence chatbot ChatGPT, can be used as a tool for automated psychological text analysis in various languages. Across 15 datasets (n = 31,789 manually annotated tweets and news headlines), we tested whether GPT-3.5 and GPT-4 can accurately detect psychological constructs (sentiment, discrete emotions, and offensiveness) across 12 languages (English, Arabic, Indonesian, and Turkish, as well as eight African languages including Swahili, Amharic, Yoruba and Kinyarwanda). We found that GPT performs much better than English-language dictionary-based text analysis (r = 0.66-0.75 for correlations between manual annotations and GPT-4, as opposed to r = 0.20-0.30 for correlations between manual annotations and dictionary methods). Further, GPT performs nearly as well as or better than several fine-tuned machine learning models, though GPT had poorer performance in African languages and in comparison to more recent fine-tuned models. Overall, GPT may be superior to many existing methods of automated text analysis, since it achieves relatively high accuracy across many languages, requires no training data, and is easy to use with simple prompts (e.g., “is this text negative?”) and little coding experience. We provide sample code for analyzing text with the GPT application programming interface. GPT and other large-language models may be the future of psychological text analysis, and may help facilitate more cross-linguistic research with understudied languages.

Text Mining with n-gram Variables

Guenther

The Stata Journal: Promoting communications on statistics and S

2017

Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions.

`Less Than One'-Shot Learning: Learning N Classes From M < N Samples

2021

AAAI

Deep neural networks require large training sets but suffer from high computational cost and long training times. Training on much smaller training sets while maintaining nearly the same accuracy would be very beneficial. In the few-shot learning setting, a model must learn a new class given only a small number of samples from that class. One-shot learning is an extreme form of few-shot learning where the model must learn a new class from a single example. We propose the 'less than one'-shot learning task where models must learn N new classes given only M