Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix: Adaptive attacks & Qualitative Examples
4.5 Attack and countermeasure parameters
We use a step size of α = 0.00001 (Eq. 1), as we empirically found this setting leads to stable attack convergence. We experiment with only unconstrained attacks (without the Πx,ϵ operation in Equation 1) as we observed that even without them, the attacks were successful at high SNRs (rendering any constraints ineffective). We run the attack for a maximum of T=100 iterations using cross-entropy loss objective. We employ early-stopping at the first occurrence of an unsafe and relevant response, further using a human preference model[11] to filter out gibberish responses produced by the model during attacks. For the countermeasures, we
experiment with several settings of TDNF by using four different SNR values: 24, 30, 48 and 60 dB.
4.6 Baseline: Random perturbations
We apply random perturbations at varying SPRs to understand if non-adversarial perturbations break the safety alignment of the LLMs. This serves as a simple baseline to characterize the robustness of the safety alignment of the models we consider. In particular, we apply WGN at 2 different SNRs for each of the audio files. We repeat this process 3 times and consider an audio jailbroken if any 1 of the 3 responses is unsafe and relevant.
Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.
[11] https://huggingface.co/OpenAssistant/ reward-model-electra-large-discriminator