Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation

Yoo, KiYoon; Kim, Jangho; Jang, Jiho; Kwak, Nojun

doi:10.18653/v1/2022.findings-acl.289

Cited by 13 publications

(7 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DISP performs very well on AGNEWS, which may be due to the synonyms replaced by these attack algorithms do not preserve the semantics of the original sentences well. 2) Consistently with Yoo et al (2022), FGWS works badly in the face of more subtle attacks, such as BAE and TextFooler. 3) Both RDE and MD are feature density-based methods, and in general, RDE works better than MD.…”

Section: Detection Performancementioning

confidence: 95%

“…Following the work of Yoo et al (2022), we divide the detection of adversarial samples into two scenarios. Scenario 1 will detect all adversarial samples, regardless of whether the model output is successfully changed or not.…”

Section: Detection Performancementioning

confidence: 99%

“…One of the most effective detection methods that can handle all textual attack algorithms is densitybased estimation approaches (Yoo et al, 2022;Feinman et al, 2017). These approaches are built on the assumption that the adversarial examples are not lying inside the non-adversarial data manifold.…”

Section: Introductionmentioning

confidence: 99%

“…Noticing that word-level adversarial algorithms often replace high-frequency words with low-frequency words,Mozes et al (2021) introduce FGWS algorithm to detect adversarial samples by word frequency properties and calibrate the adversarial samples to improve the model performance Yoo et al (2022). propose RDE which utilizes multivariate Gaussian distribution to model the feature density of clean samples.…”

mentioning

confidence: 99%

See 3 more Smart Citations

CASN:Class-Aware Score Network for Textual Adversarial Detection

Bao,

Zheng,

Ding

et al. 2023

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Adversarial detection aims to detect adversarial samples that threaten the security of deep neural networks, which is an essential step toward building robust AI systems. Density-based estimation is widely considered as an effective technique by explicitly modeling the distribution of normal data and identifying adversarial ones as outliers. However, these methods suffer from significant performance degradation when the adversarial samples lie close to the non-adversarial data manifold. To address this limitation, we propose a score-based generative method to implicitly model the data distribution. Our approach utilizes the gradient of the log-density data distribution and calculates the distribution gap between adversarial and normal samples through multi-step iterations using Langevin dynamics. In addition, we use supervised contrastive learning to guide the gradient estimation using label information, which avoids collapsing to a single data manifold and better preserves the anisotropy of the different labeled data distributions. Experimental results on three text classification tasks upon four advanced attack algorithms show that our approach is a significant improvement (+15.2 F1 score on average against previous SOTA) over previous detection methods.

show abstract

Section: Detection Performancementioning

confidence: 95%

Section: Detection Performancementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

CASN:Class-Aware Score Network for Textual Adversarial Detection

Bao,

Zheng,

Ding

et al. 2023

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

show abstract

“…Enhancing reliability can be accomplished through the use of advanced uncertainty estimation (UE) techniques (Lakshminarayanan et al, 2017;Gal and Ghahramani, 2016;Lee et al, 2018;Liu et al, 2020;Podolskiy et al, 2021;Xin et al, 2021;Yoo et al, 2022). Promoting model fairness entails defining fairness metrics and employing special debiasing techniques (Elazar and Goldberg, 2018;Wang et al, 2019;Ravfogel et al, 2020;Han et al, 2021Han et al, , 2022aBaldini et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

Uncertainty Estimation for Debiased Models: Does Fairness Hurt Reliability?

Kuzmin,

Vazhentsev,

Shelmanov

et al. 2023

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacifi

View full text Add to dashboard Cite

When deploying a machine learning model, one should aim not only to optimize performance metrics such as accuracy but also care about model fairness and reliability. Fairness means that the model is prevented from learning spurious correlations between a target variable and socio-economic attributes, and is generally achieved by applying debiasing techniques. Model reliability stems from the ability to determine whether we can trust model predictions for the given data. This can be achieved using uncertainty estimation (UE) methods. Debiasing and UE techniques potentially interfere with each other, raising the question of whether we can achieve both reliability and fairness at the same time. This work aims to answer this question empirically based on an extensive series of experiments combining state-of-the-art UE and debiasing methods, and examining the impact on model performance, fairness, and reliability. 1 * Research was conducted while working at TII.

show abstract

Adversarial Evasion on LLMs

Guerraoui,

Pinot

2024

Large Language Models in Cybersecurity

View full text Add to dashboard Cite

While Machine Learning (ML) applications have shown impressive achievements in tasks such as computer vision, NLP, and control problems, such achievements were possible, first and foremost, in the best-case-scenario setting. Unfortunately, settings where ML applications fail unexpectedly, abound, and malicious ML application users or data contributors can trigger such failures. This problem became known as adversarial example robustness. While this field is in rapid development, some fundamental results have been uncovered, allowing some insight into how to make ML methods resilient to input and data poisoning. Such ML applications are termed adversarially robust. While the current generation of LLMs is not adversarially robust, results obtained in other branches of ML can provide insight into how to make them adversarially robust. Such insight would complement and augment ongoing empirical efforts in the same direction (red-teaming).

show abstract

Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation

Cited by 13 publications

References 13 publications

CASN:Class-Aware Score Network for Textual Adversarial Detection

CASN:Class-Aware Score Network for Textual Adversarial Detection

Uncertainty Estimation for Debiased Models: Does Fairness Hurt Reliability?

Adversarial Evasion on LLMs

Contact Info

Product

Resources

About