Hellinger Distance Trees for Imbalanced Streams

Lyon, Robert; Brooke, John; Knowles, Joshua; Stappers, B. W.

doi:10.1109/icpr.2014.344

Cited by 27 publications

(32 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data streams are quasi-infinite sequences of information, which are temporally ordered and indeterminable in size (Gaber et al 2005;Lyon et al 2013Lyon et al , 2014. Data streams are produced by many modern computer systems (Gaber et al 2005) and are likely to arise from the increasing volumes of data output by modern radio telescopes, especially the SKA.…”

Section: Stream Classificationmentioning

confidence: 99%

“…Data streams are produced by many modern computer systems (Gaber et al 2005) and are likely to arise from the increasing volumes of data output by modern radio telescopes, especially the SKA. However many of the effective supervised machine learning techniques used for candidate selection do not work with streams (Lyon et al 2014). Adapting existing methods for use with streams is challenging, it remains an active goal of data mining research (Yang & Wu 2006;Gaber et al 2007).…”

Section: Stream Classificationmentioning

confidence: 99%

“…It is designed to maximise classification performance on candidate data streams, which are heavily imbalanced in favour of the non-pulsar class. It is the first candidate selection algorithm designed to mitigate the imbalanced learning problem (He & Garcia 2009;Lyon et al 2013Lyon et al , 2014, known to reduce classification accuracy when one class of examples (i.e. non-pulsar) dominates the other.…”

Section: Gaussian-hellinger Very Fast Decision Treementioning

confidence: 99%

“…The Gaussian-Hellinger Very Fast Decision Tree (GH-VFDT) is an incremental stream classifier, developed specifically for the candidate selection problem (Lyon et al 2014). It is a tree-based algorithm based on the Very Fast Decision tree (VFDT) developed by Hulten et al (2001).…”

Section: Gaussian-hellinger Very Fast Decision Treementioning

confidence: 99%

“…When applied to a data stream containing 10,000 non-pulsar candidates for every legitimate pulsar (HTRU data obtained by Thornton (2013)), it raised the recall rate from 30 to 86 per cent (Lyon et al 2014). This was achieved using candidate data described using the features designed by Bates et al (2012) and Thornton (2013).…”

mentioning

confidence: 99%

See 4 more Smart Citations

Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach

Lyon

Stappers

Cooper

et al. 2016

Mon. Not. R. Astron. Soc.

199

174

View full text Add to dashboard Cite

Improving survey specifications are causing an exponential rise in pulsar candidate numbers and data volumes. We study the candidate filters used to mitigate these problems during the past fifty years. We find that some existing methods such as applying constraints on the total number of candidates collected per observation, may have detrimental effects on the success of pulsar searches. Those methods immune to such effects are found to be ill-equipped to deal with the problems associated with increasing data volumes and candidate numbers, motivating the development of new approaches. We therefore present a new method designed for on-line operation. It selects promising candidates using a purpose-built tree-based machine learning classifier, the Gaussian Hellinger Very Fast Decision Tree (GH-VFDT), and a new set of features for describing candidates. The features have been chosen so as to i) maximise the separation between candidates arising from noise and those of probable astrophysical origin, and ii) be as survey-independent as possible. Using these features our new approach can process millions of candidates in seconds (∼1 million every 15 seconds), with high levels of pulsar recall (90%+). This technique is therefore applicable to the large volumes of data expected to be produced by the Square Kilometre Array (SKA). Use of this approach has assisted in the discovery of 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey (LOTAAS).

show abstract

Section: Stream Classificationmentioning

confidence: 99%