Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix: Adaptive attacks & Qualitative Examples
5. Results & Discussion
In this section, we first analyze the safety alignment of several SLMs followed by the results of sample-specific and transfer-based attacks, and also show the effectiveness of the TDNF defense.
5.1 Safety-aligned SLMs
We compare the efficacies of different SLMs trained using the SpeechVerse architecture, against a public SLM models SpeechGPT (Zhang et al., 2023) in Table 2. In addition, we also compare the performance of text-only pre-trained LLMs out-ofthe-box. We also compare fine-tuned Flan-T5-XL (3B) and Mistral-7B LLMs, safety aligned with the textual form of Spoken QA data.
Our results demonstrate the superior performance of our SLM models compared to public models, closely matching the performance of the best text-only LLMs on safety and relevance. As hypothesized, SLM models pre-adapted with ASR match or outperform their counterparts on all metrics demonstrating a better recognition of speech modality. We observe that the helpfulness of the SLM models is limited by the abilities of the pretrained LLM, although tuned with general instruction data during cross-modal adaptation. Furthermore, using our training mechanisms, we observe that we can retain almost all the helpfulness of pre-trained LLMs, while additionally infusing the abilities of spoken instruction understanding as well as safety alignment into SLMs.[12] Compared to SpeechGPT (Zhang et al., 2023), our best model shows more than 40% improvement in safety and 20% in helpfulness, demonstrating better recognition quality and speech instruction following capability. Although other public models like LLASM (Shu et al., 2023) and Pengi (Deshmukh et al., 2023) also have the capability to perceive speech instructions, we found those models to be not sufficiently safety aligned and hence left them out from our benchmarking.
5.2 Sample-specific white-box attacks
In Table 3, we present results of random noise perturbations at two SNR values, along with samplespecific adversarial attacks on four in-house trained SLM models. We report results only on the samples that were originally found to be safe for each model (as reported in Table 2) out of the 360 audios considered. Random perturbations demonstrate limited effectiveness in jailbreaking most models, with attack success rate below 8% for all models. In contrast, adversarial perturbations achieve a success rate (∼90%) in all cases at ∼60dB SPR. This shows that carefully crafted perturbations, even at small magnitudes can cause the models to produce unsafe responses[13]. Therefore more sophisticated speech-specific attacks that have been proposed to produce imperceptible perturbations are not necessary (Schönherr et al., 2018).
In Figure 4, we plot the cumulative proportion of successful attacks as a function of the number of attack iterations. We see that different models exhibit varying levels of susceptibility to adversarial jailbreaking attacks. For example, 80% of the successful attacks require fewer than 20 iterations for Mistral-based models, whereas attacks on the FlanT5-based models require upto 40 iterations.
Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.
[12] We study the effect of excluding general instruction tuning data for SLM training in Appendix A.4.