Adversarial Settings And Random Noise Reveal Speech LLM Vulnerabilities

Table of Links

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

4.5 Attack and countermeasure parameters

We use a step size of α = 0.00001 (Eq. 1), as we empirically found this setting leads to stable attack convergence. We experiment with only unconstrained attacks (without the Πx,ϵ operation in Equation 1) as we observed that even without them, the attacks were successful at high SNRs (rendering any constraints ineffective). We run the attack for a maximum of T=100 iterations using cross-entropy loss objective. We employ early-stopping at the first occurrence of an unsafe and relevant response, further using a human preference model[11] to filter out gibberish responses produced by the model during attacks. For the countermeasures, we

experiment with several settings of TDNF by using four different SNR values: 24, 30, 48 and 60 dB.

4.6 Baseline: Random perturbations

We apply random perturbations at varying SPRs to understand if non-adversarial perturbations break the safety alignment of the LLMs. This serves as a simple baseline to characterize the robustness of the safety alignment of the models we consider. In particular, we apply WGN at 2 different SNRs for each of the audio files. We repeat this process 3 times and consider an audio jailbroken if any 1 of the 3 responses is unsafe and relevant.

Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.

[11] https://huggingface.co/OpenAssistant/ reward-model-electra-large-discriminator

Adversarial Settings and Random Noise Reveal Speech LLM Vulnerabilities | HackerNoon

Table of Links

4.5 Attack and countermeasure parameters

4.6 Baseline: Random perturbations

Leave a Reply

Table of Links

4.5 Attack and countermeasure parameters

4.6 Baseline: Random perturbations

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply