Improving survey specifications are causing an exponential rise in pulsar candidate numbers and data volumes. We study the candidate filters used to mitigate these problems during the past fifty years. We find that some existing methods such as applying constraints on the total number of candidates collected per observation, may have detrimental effects on the success of pulsar searches. Those methods immune to such effects are found to be ill-equipped to deal with the problems associated with increasing data volumes and candidate numbers, motivating the development of new approaches. We therefore present a new method designed for on-line operation. It selects promising candidates using a purpose-built tree-based machine learning classifier, the Gaussian Hellinger Very Fast Decision Tree (GH-VFDT), and a new set of features for describing candidates. The features have been chosen so as to i) maximise the separation between candidates arising from noise and those of probable astrophysical origin, and ii) be as survey-independent as possible. Using these features our new approach can process millions of candidates in seconds (∼1 million every 15 seconds), with high levels of pulsar recall (90%+). This technique is therefore applicable to the large volumes of data expected to be produced by the Square Kilometre Array (SKA). Use of this approach has assisted in the discovery of 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey (LOTAAS).
One of the biggest challenges arising from modern large-scale pulsar surveys is the number of candidates generated. Here, we implemented several improvements to the machine learning classifier previously used by the LOFAR Tied-Array All-Sky Survey (LOTAAS) to look for new pulsars via filtering the candidates obtained during periodicity searches. To assist the machine learning algorithm, we have introduced new features which capture the frequency and time evolution of the signal and improved the signal-to-noise calculation accounting for broad profiles. We enhanced the machine learning classifier by including a third class characterising RFI instances, allowing candidates arising from RFI to be isolated, reducing the false positive return rate. We also introduced a new training dataset used by the machine learning algorithm that includes a large sample of pulsars misclassified by the previous classifier. Lastly, we developed an ensemble classifier comprised of five different Decision Trees. Taken together these updates improve the pulsar recall rate by 2.5 per cent, while also improving the ability to identify pulsars with wide pulse profiles, often misclassified by the previous classifier. The new ensemble classifier is also able to reduce the percentage of false positive candidates identified from each LOTAAS pointing from 2.5 per cent (~500 candidates) to 1.1 per cent (~220 candidates).
Searches for millisecond-duration, dispersed single pulses have become a standard tool used during radio pulsar surveys in the last decade. They have enabled the discovery of two new classes of sources: rotating radio transients and fast radio bursts. However, we are now in a regime where the sensitivity to single pulses in radio surveys is often limited more by the strong background of radio frequency interference (RFI, which can greatly increase the false-positive rate) than by the sensitivity of the telescope itself. To mitigate this problem, we introduce the Single-pulse Searcher (SpS). This is a new machine-learning classifier designed to identify astrophysical signals in a strong RFI environment, and optimized to process the large data volumes produced by the new generation of aperture array telescopes. It has been specifically developed for the LOFAR Tied-Array All-Sky Survey (LOTAAS), an ongoing survey for pulsars and fast radio transients in the northern hemisphere. During its development, SpS discovered 7 new pulsars and blindly identified ∼80 known sources. The modular design of the software offers the possibility to easily adapt it to other studies with different instruments and characteristics. Indeed, SpS has already been used in other projects, e.g. to identify pulses from the fast radio burst source FRB 121102. The software development is complete and SpS is now being used to re-process all LOTAAS data collected to date.
Abstract-Classifiers trained on data sets possessing an imbalanced class distribution are known to exhibit poor generalisation performance. This is known as the imbalanced learning problem. The problem becomes particularly acute when we consider incremental classifiers operating on imbalanced data streams, especially when the learning objective is rare class identification. As accuracy may provide a misleading impression of performance on imbalanced data, existing stream classifiers based on accuracy can suffer poor minority class performance on imbalanced streams, with the result being low minority class recall rates. In this paper we address this deficiency by proposing the use of the Hellinger distance measure, as a very fast decision tree split criterion. We demonstrate that by using Hellinger a statistically significant improvement in recall rates on imbalanced data streams can be achieved, with an acceptable increase in the false positive rate.
The accurate automated classification of variable stars into their respective subtypes is difficult. Machine learning–based solutions often fall foul of the imbalanced learning problem, which causes poor generalization performance in practice, especially on rare variable star subtypes. In previous work, we attempted to overcome such deficiencies via the development of a hierarchical machine learning classifier. This ‘algorithm-level’ approach to tackling imbalance yielded promising results on Catalina Real-Time Survey (CRTS) data, outperforming the binary and multiclass classification schemes previously applied in this area. In this work, we attempt to further improve hierarchical classification performance by applying ‘data-level’ approaches to directly augment the training data so that they better describe underrepresented classes. We apply and report results for three data augmentation methods in particular: Randomly Augmented Sampled Light curves from magnitude Error (RASLE), augmenting light curves with Gaussian Process modelling (GpFit) and the Synthetic Minority Oversampling Technique (SMOTE). When combining the ‘algorithm-level’ (i.e. the hierarchical scheme) together with the ‘data-level’ approach, we further improve variable star classification accuracy by 1–4 per cent. We found that a higher classification rate is obtained when using GpFit in the hierarchical model. Further improvement of the metric scores requires a better standard set of correctly identified variable stars, and perhaps enhanced features are needed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.