Abstract. One-class classification has important applications such as outlier and novelty detection. It is commonly tackled using density estimation techniques or by adapting a standard classification algorithm to the problem of carving out a decision boundary that describes the location of the target data. In this paper we investigate a simple method for one-class classification that combines the application of a density estimator, used to form a reference distribution, with the induction of a standard model for class probability estimation. In this method, the reference distribution is used to generate artificial data that is employed to form a second, artificial class. In conjunction with the target class, this artificial class is the basis for a standard two-class learning problem. We explain how the density function of the reference distribution can be combined with the class probability estimates obtained in this way to form an adjusted estimate of the density function of the target class. Using UCI datasets, and data from a typist recognition problem, we show that the combined model, consisting of both a density estimator and a class probability estimator, can improve on using either component technique alone when used for one-class classification. We also compare the method to one-class classification using support vector machines.
The ability to accurately predict the conception outcome for a future mating would be of considerable benefit for producers in deciding what mating plan (i.e., expensive semen or less expensive semen) to implement for a given cow. The objective of the present study was to use herd- and cow-level factors to predict the likelihood of conception success to a given insemination (i.e., conception outcome not including embryo loss); of particular interest in the present study was the usefulness of milk mid-infrared (MIR) spectral data in augmenting the accuracy of the prediction model. A total of 4,341 insemination records with conception outcome information from 2,874 lactations on 1,789 cows from 7 research herds for the years 2009 to 2014 were available. The data set was separated into a calibration data set and a validation data set using either of 2 approaches: (1) the calibration data set contained records from all 7 farms for the years 2009 to 2011, inclusive, and the validation data set included data from the 7 farms for the years 2012 to 2014, inclusive, or (2) the calibration data set contained records from 5 farms for all 6 yr and the validation data set contained information from the other 2 farms for all 6 yr. The prediction models were developed with 8 different machine learning algorithms in the calibration data set using standard 10-times 10-fold cross-validation and also by evaluating in the validation data set. The area under curve (AUC) for the receiver operating curve varied from 0.487 to 0.675 across the different algorithms and scenarios investigated. Logistic regression was generally the best-performing algorithm. The AUC was generally inferior for the external validation data sets compared with the calibration data sets. The inclusion of milk MIR in the prediction model generally did not improve the accuracy of prediction. Despite the fair AUC for predicting conception outcome under the different scenarios investigated, the model provided a reasonable prediction of the likelihood of conception success when the high predicted probability instances were considered; a conception rate of 85% was evident in the top 10% of inseminations ranked on predicted probability of conception success in the validation data set.
Protocol tunneling is widely used to add security and/or privacy to Internet applications. Recent research has exposed side channel vulnerabilities that leak information about tunneled protocols. We first discuss the timing side channels that have been found in protocol tunneling tools. We then show how to infer Hidden Markov models (HMMs) of network protocols from timing data and use the HMMs to detect when protocols are active. Unlike previous work, the HMM approach we present requires no a priori knowledge of the protocol. To illustrate the utility of this approach, we detect the use of English or Italian in interactive SSH sessions. For this example application, keystroke-timing data associates inter-packet delays with keystrokes. We first use clustering to extract discrete information from continuous timing data. We use discrete symbols to infer a HMM model, and finally use statistical tests to determine if the observed timing is consistent with the language typing statistics. In our tests, if the correct window size is used, fewer than 2% of data windows are incorrectly identified. Experimental verification shows that on-line detection of language use in interactive encrypted protocol tunnels is reliable. We compare maximum likelihood and statistical hypothesis testing for detecting protocol tunneling. We also discuss how this approach is useful in monitoring mix networks like The Onion Router (Tor).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.