Unified Speech And Language Models Can Be Vulnerable To Adversarial Attacks

Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.

Table of Links

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix:Adaptive attacks & Qualitative Examples

Abstract

Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and blackbox attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories.[1] However, we demonstrate that our proposed countermeasures reduce the attack success significantly.

1. Introduction

As large language models (LLMs) obtain broad and diverse capabilities, it is imperative to understand and mitigate any potential harm caused by them,

Figure 1: Adversarial attacks setup to jailbreak speech language models trained for Spoken QA task. The striped block indicates an optional counter-measure module.

as well as prevent their misuse by malicious actors (Bender et al., 2021; Bai et al., 2022; OpenAI, 2024). LLM developers have begun to train models explicitly for “safety alignment” to deter them from producing unsafe responses (Askell et al., 2021). However, these LLMs have been found to be susceptible to adversarial attacks, where carefully crafted perturbations on the prompts were used to jailbreak the models’ safety training (Zou et al., 2023). More recently, visual language models (VLMs) have also been shown to be vulnerable to such attacks, where the attacks are performed on the image modality (Carlini et al., 2023; Qi et al., 2023). In this work, we investigate the vulnerability of speech language models’ (SLMs) safety guardrails against adversarial perturbations of the input speech signal, and explore countermeasures against such attacks. In particular, we assess SLMs through the lens of spoken question-answering (Spoken QA) task, and investigate jailbreaking their safety guardrails. while also considering their overall utility (helpfulness) and the relevance of the produced responses to the question. We perform extensive experiments using different adversarial threat scenarios, including white-box and transfer-based attacks. We show that a malicious adversary with full (white-box) access to a SLM’s gradients can jailbreak its safety training using barely perceptible perturbations on the input audio. Though weaker than white-box attacks, we also demonstrate that perturbations generated using one model transfer to other models, and that different model architectures show different levels of vulnerability. We further propose countermeasures against the adversarial threats and show that adding random noise (a simplified version of randomized smoothing defense (Cohen et al., 2019)) can provide reasonable robustness against the attacks.

We summarize our contributions below:

To our knowledge, this is the first study examining the potential safety limitations of unified speech and language models for jailbreaking.
Present a setup to comprehensively benchmark the safety alignment and utility of SLMs. Characterize the vulnerability of such models and the effectiveness of adversarial speech perturbations in jailbreaking their safety guardrails.
Explore transferability of adversarial attacks across models, assuming various levels of information available to an attacker and consequently present simple yet effective countermeasures to improve the adversarial robustness of SLMs.

[1] Content Warning: This paper contains examples of harmful language that might be disturbing to some readers.

Unified Speech and Language Models Can Be Vulnerable to Adversarial Attacks | HackerNoon

Table of Links

Abstract

1. Introduction

Leave a Reply Cancel reply

Stay Connected

Latest News

Revealing interesting data, Verizon spills the “t” with the 2024 Consumer Connections report

What if The Beatles made a Motown album? This AI reimagining of Rubber Soul sounds unbelievably real

https://news.google.com/read/CBMiekFVX3lxTE5sazEzckh5ZzhtRlRKZlNKd2t4WWxjT185WnRaZjdHOW5IZElRSjB2ajRzaW9ienU4cy0yR2xiaXFfQ0lHNTFIWWNudlZKcl9iYmJkSk9jcjRvYnUxRkx2eVlvNDNiZC0tbEFkemxiZFJmeURRWTQ3ZlFn?hl=en-GB&gl=GB&ceid=GB%3Aen

Cross-Prompt Attacks and Data Ablations Impact SLM Robustness | HackerNoon

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

Abstract

1. Introduction

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News