Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.
Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix:Adaptive attacks & Qualitative Examples
Abstract
Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and blackbox attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories.[1] However, we demonstrate that our proposed countermeasures reduce the attack success significantly.
1. Introduction
As large language models (LLMs) obtain broad and diverse capabilities, it is imperative to understand and mitigate any potential harm caused by them,
as well as prevent their misuse by malicious actors (Bender et al., 2021; Bai et al., 2022; OpenAI, 2024). LLM developers have begun to train models explicitly for “safety alignment” to deter them from producing unsafe responses (Askell et al., 2021). However, these LLMs have been found to be susceptible to adversarial attacks, where carefully crafted perturbations on the prompts were used to jailbreak the models’ safety training (Zou et al., 2023). More recently, visual language models (VLMs) have also been shown to be vulnerable to such attacks, where the attacks are performed on the image modality (Carlini et al., 2023; Qi et al., 2023). In this work, we investigate the vulnerability of speech language models’ (SLMs) safety guardrails against adversarial perturbations of the input speech signal, and explore countermeasures against such attacks. In particular, we assess SLMs through the lens of spoken question-answering (Spoken QA) task, and investigate jailbreaking their safety guardrails. while also considering their overall utility (helpfulness) and the relevance of the produced responses to the question. We perform extensive experiments using different adversarial threat scenarios, including white-box and transfer-based attacks. We show that a malicious adversary with full (white-box) access to a SLM’s gradients can jailbreak its safety training using barely perceptible perturbations on the input audio. Though weaker than white-box attacks, we also demonstrate that perturbations generated using one model transfer to other models, and that different model architectures show different levels of vulnerability. We further propose countermeasures against the adversarial threats and show that adding random noise (a simplified version of randomized smoothing defense (Cohen et al., 2019)) can provide reasonable robustness against the attacks.
We summarize our contributions below:
-
To our knowledge, this is the first study examining the potential safety limitations of unified speech and language models for jailbreaking.
-
Present a setup to comprehensively benchmark the safety alignment and utility of SLMs. Characterize the vulnerability of such models and the effectiveness of adversarial speech perturbations in jailbreaking their safety guardrails.
-
Explore transferability of adversarial attacks across models, assuming various levels of information available to an attacker and consequently present simple yet effective countermeasures to improve the adversarial robustness of SLMs.
[1] Content Warning: This paper contains examples of harmful language that might be disturbing to some readers.