Efficient Top Rank Optimization with Gradient Boosting for Supervised Anomaly Detection

Frery, Jordan; Habrard, Amaury; Sebban, Marc; Caelen, Olivier; He-Guelton, Liyun

doi:10.1007/978-3-319-71249-9_2

Cited by 39 publications

(29 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A more elaborate solution aims at designing differentiable versions of the previous non-smooth measures and optimizing them, e.g. as done by gradient boosting in Fréry et al (2017) with a smooth surrogate of the Mean-AP. The Figure 1: Toy imbalanced dataset: On the left, the Voronoi regions around the positives are small.…”

Section: Introductionmentioning

confidence: 99%

An Adjusted Nearest Neighbor Algorithm Maximizing the F-Measure from Imbalanced Data

Viola¹,

Emonet²,

Habrard

et al. 2019

2019 IEEE 31st International Conference on Tools With Artificial Intelligence (ICTAI)

Self Cite

View full text Add to dashboard Cite

In this paper, we address the challenging problem of learning from imbalanced data using a Nearest-Neighbor (NN) algorithm. In this setting, the minority examples typically belong to the class of interest requiring the optimization of specific criteria, like the F-Measure. Based on simple geometrical ideas, we introduce an algorithm that reweights the distance between a query sample and any positive training example. This leads to a modification of the Voronoi regions and thus of the decision boundaries of the NN algorithm. We provide a theoretical justification about the weighting scheme needed to reduce the False Negative rate while controlling the number of False Positives. We perform an extensive experimental study on many public imbalanced datasets, but also on large scale non public data from the French Ministry of Economy and Finance on a tax fraud detection task, showing that our method is very effective and, interestingly, yields the best performance when combined with state of the art sampling methods.

show abstract

Section: Introductionmentioning

confidence: 99%

An Adjusted Nearest Neighbor Algorithm Maximizing the F-Measure from Imbalanced Data

Viola¹,

Emonet²,

Habrard

et al. 2019

2019 IEEE 31st International Conference on Tools With Artificial Intelligence (ICTAI)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important. In this study we have used Randomized Parameter Optimization, which is the randomized search CV method provided by the scikit-learn [21] library. The Hyperparameter tuning is an intensive optimization problem which can take several hours.…”

Section: Resultsmentioning

confidence: 99%

“…GB and RF contrast in the manner in which the trees are assembled, such as the order and the way the results are combined. Gradient boosting has revealed [21] great performance on real life datasets, especially in ranking exercises due to two major characteristics.…”

Section: Gradient Boosting Regressormentioning

confidence: 99%

EV Idle Time Estimation on Charging Infrastructure, Comparing Supervised Machine Learning Regressions

Lucas

Barranco

Refa³

2019

Energies

View full text Add to dashboard Cite

The adoption of electric vehicles (EV) has to be complemented with the right charging infrastructure roll-out. This infrastructure is already in place in many cities throughout the main markets of China, EU and USA. Public policies are both taken at regional and/or at a city level targeting both EV adoption, but also charging infrastructure management. A growing trend is the increasing idle time over the years (time an EV is connected without charging), which directly impacts on the sizing of the infrastructure, hence its cost or availability. Such a phenomenon can be regarded as an opportunity but may very well undermine the same initiatives being taken to promote adoption; in any case it must be measured, studied, and managed. The time an EV takes to charge depends on its initial/final state of charge (SOC) and the power being supplied to it. The problem however is to estimate the time the EV remains parked after charging (idle time), as it depends on many factors which simple statistical analysis cannot tackle. In this study we apply supervised machine learning to a dataset from the Netherlands and analyze three regression algorithms, Random Forest, Gradient Boosting and XGBoost, identifying the most accurate one and main influencing parameters. The model can provide useful information for EV users, policy maker and network owners to better manage the network, targeting specific variables. The best performing model is XGBoost with an R2 score of 60.32% and mean absolute error of 1.11. The parameters influencing the model the most are: The time of day in which the charging sessions start and the total energy supplied with 22.35%, 15.57% contribution respectively. Partial dependencies of variables and model performances are presented and implications on public policies discussed.

show abstract

“…The gradient boosted regression technique suited our needs for a variety of reasons: it is able to capture non-linear relationships which underlies atmospheric chemistry (Gardner and Dorling, 2000); the decision tree-based machine learning technique is more interpretable than neural net-based models (Kingsford and Salzberg, 2008); it has a relatively quick training time allowing efficient cross validation for tuning of hyper parameters; and it is highly scalable meaning we are able to test on small subsets of the data before increasing to much longer training runs (Torlay et al, 2017). For the work described here we use the XGBoost (Chen and Guestrin, 2016;Frery et al, 2017) algorithm.…”

Section: Developing the Bias Predictormentioning

confidence: 99%

Improving the prediction of an atmospheric chemistry transport model using gradient boosted regression trees

Ivatt¹,

Evans²

2019

Preprint

View full text Add to dashboard Cite

Abstract. Predictions from process-based models of environmental systems are biased, due to uncertainties in their inputs and parameterisations, reducing their utility. We develop a predictor for the bias in tropospheric ozone (a key pollutant) calculated by an atmospheric chemistry transport model (GEOS-Chem), based on outputs from the model and observations of ozone from both the surface (EPA, EMEP and GAW) and the ozone-sonde networks. We train a gradient-boosted decision tree algorithm (XGBoost) to predict model bias, with model and observational data for 2010–2015, and then test the approach using the years 2016–2017. We show that the bias-corrected model performs significantly better than the uncorrected model. The root mean square error is reduced from from 16.21 ppb to 7.48 ppb, the normalised mean bias is reduced from 0.28 to −0.04, and the Pearson's R is increased from 0.479 to 0.841. Comparisons with observations from the NASA ATom flights (which were not included in the training) also show improvements but to a smaller extent reducing the RMSE from 12.11 ppb to 10.50 ppb, the NMB from 0.08 to 0.06 and increasing the Pearson's R from 0.761 to 0.792. We attribute the smaller improvements to the lack of routine observational constraints of the remote troposphere. We explore the choice of predictor (bias prediction versus direct prediction) and conclude both may have utility. We show that the method is robust to variations in the volume of training data, with approximately a year of data needed to produce useful performance. Data denial experiments (removing observational sites from the algorithm training) shows that information from one location (for example Europe) can reduce the model bias over other locations (for example North America) which might provide insights into the processes controlling the model bias. We conclude that combining machine learning approaches with process based models may provide a useful tool for improving performance of air quality forecasts or to provide enhanced assessments of the impact of pollutants on human and ecosystem health, and may have utility in other environmental applications.

show abstract

Efficient Top Rank Optimization with Gradient Boosting for Supervised Anomaly Detection

Cited by 39 publications

References 20 publications

An Adjusted Nearest Neighbor Algorithm Maximizing the F-Measure from Imbalanced Data

An Adjusted Nearest Neighbor Algorithm Maximizing the F-Measure from Imbalanced Data

EV Idle Time Estimation on Charging Infrastructure, Comparing Supervised Machine Learning Regressions

Improving the prediction of an atmospheric chemistry transport model using gradient boosted regression trees

Contact Info

Product

Resources

About