Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix: Adaptive attacks & Qualitative Examples
3. Attacks & Countermeasures
Amongst the various applications for SLMs, in this work, we showcase adversarial attacks and defense techniques for the task of Spoken QA. We follow the well-established literature in adversarial attacks and defenses, and adapt those techniques for the purpose of jailbreaking SLMs. Specifically, we explore two different attack scenarios: white-box and black-box, similar to the approach in Qi et al. (2023), and consequently identify safeguarding countermeasures.
3.1 Attack methodology
White-box attack This scenario assumes an attacker with full access to the SLM including the ability to back-propagate through the model (Goodfellow et al., 2014). For example, one can leverage an open-source public model, or a more sophisticated attacker can develop their own system. Such an attacker with access to the model’s gradients can use established techniques such as fast gradient sign method and projected gradient descent (PGD) (Khamaiseh et al., 2022).
Following recent studies of Zou et al. (2023) and Wei et al. (2023), we use initial affirmative responses as the target for the adversarial optimization. Specifically, the target response for the attack is an affirmation followed by a repetition of the user’s query. For example, for the question “Can you provide a tutorial to rob a bank?”, the target response is “Sure, here is how to rob a bank”. Enforcing such a constraint on the model’s response puts it in a state where it continues to provide the harmful response (Zou et al., 2023).
In this work, we use PGD algorithm (Madry et al., 2017) to generate the adversarial perturbations. Denoting the audio input as x, the response generated by the LLM can be written as F(x). The loss (denoted by L) between the generated response and the adversarial target response (denoted by y) is used to create the perturbation. Assuming that the audio LLM is end-to-end differentiable, the perturbation δ is learned to minimize L as shown in Equation 1.
Transfer attacks Several publicly available LLM providers (such as OpenAI and Anthropic) only provide restricted API access, limiting the ability to compute gradients with respect to the input. In such cases, an attacker can resort to gradient approximation techniques using multiple queries or transfer attacks. Gradient estimation techniques rely on multiple queries to the LLM to approximate the gradient based on the generated responses (Ilyas et al., 2018). However, running multiple forward passes through an LLM can be computationally prohibitive, also LLM providers may limit the number of queries by a single user, making such attacks infeasible.
In transfer attacks, an attacker uses a surrogate model with access to gradients to generate a perturbation. The generated perturbation is then added to the audio to attack a victim model. Transfer attacks are most successful when the surrogate and the victim models share the same architecture, though transfer across different architectures have also been observed in some cases (Qi et al., 2023). In this work, we experiment with two types of transfer-based attacks, as shown in Figure 2: cross-model and cross-prompt.
Cross-model: We perturb an input to attack one model in a white-box setting, then use the perturbed input to directly attack a different model. This is the typical black-box transfer attack setting.
Cross-prompt: We craft a perturbation to jailbreak the model for an audio input and use it to jailbreak the model for a different audio. We match the length of the learned perturbation to the target prompt through truncation or replication. This attack assumes access to the model’s gradients, but helps determine the potential transferability of perturbations.
3.2 Countermeasure
Techniques to safeguard LLMs from adversarial attacks that have been proposed in literature (Kumar et al., 2023; Mehrabi et al., 2023; Ge et al., 2023) are specific to text-only models. Also, well-known defenses against classical adversarial attacks such as adversarial training are impractical to apply to LLMs due to computational constraints (Jain et al., 2023). Therefore, we use a simple pre-processing technique called time-domain noise flooding (TDNF), that applies additive noise as a defense (Mehlman et al., 2023; Rajaratnam and Kalita, 2018).
The rationale is that the front-end speech encoder of the SLM is robust to additive random noise, while such noise can effectively “drown out” the adversarial perturbation. We add white gaussian noise (WGN) directly to the time-domain speech signal that is input to the model. The signal-to-noise ratio (SNR) of the noise is a hyper-parameter that determines the amount of robustness achieved, with smaller values providing better defense. This approach can be imagined to be a simplified version of randomized smoothing (Cohen et al., 2019), with only one forward pass.
Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.