Classi¼cation models of the fragrance properties of chemical compounds were performed using linear and non-linear models. The dataset was divided into three classes on the basis of their fragrances: apple, pineapple and rose. The three-class problem was ¼rst explored by a linear classi¼er approach, using linear discriminant analysis (LDA). A more accurate prediction model, the non-linear machine-learning technique, support vector machine (SVM), was subsequently investigated. Descriptors calculated from the molecular structures alone were used to represent the characteristics of compounds. The model containing four descriptors founded by SVM showed better predictive ability than LDA. The accuracy in the prediction for the three datasets was 96.6%, 80.0% and 100% for SVM, respectively. The results indicate that SVM can be used as a powerful modelling tool for QSAR studies and the selected descriptors can represent the fragrances of these chemical compounds.
The emergence of large-scale pre-trained language models (PLMs), such as ChatGPT, creates opportunities for malicious actors to disseminate disinformation, necessitating the development of automated techniques for detecting machine-generated content. However, current approaches, which predominantly rely on fine-tuning a PLM, face difficulties in identifying text beyond the scope of the detector's training corpus. This is a typical situation in practical applications, as it is impossible for the training corpus to encompass every conceivable disinformation domain. To overcome these limitations, we introduce STADEE, a STAtistics-based DEEp detection method that integrates essential statistical features of text with a sequence-based deep classifier. We utilize various statistical features, such as the probability, rank, cumulative probability of each token, as well as the information entropy of the distribution at each position. Cumulative probability is especially significant, as it is explicitly designed for nucleus sampling, the most prevalent text generation algorithm currently. To assess the efficacy of our proposed technique, we employ and develop three distinct datasets covering various domains and models: HC3-Chinese, ChatGPT-CNews, and CPM-CNews. Based on these datasets, we establish three separate experimental configurations-namely, in-domain, out-of-domain, and in-the-wild-to evaluate the generalizability of our detectors. Experimental outcomes reveal that STADEE achieves an F1 score of 87.05% in the in-domain setting, a 9.28% improvement over conventional statistical methods. Furthermore, in both the out-of-domain and in-the-wild settings, STADEE not only surpasses traditional statistical methods but also demonstrates a 5.5% enhancement compared to fine-tuned PLMs. These findings underscore the generalizability of our STADEE in detecting machine-generated text.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.