SLMs Outperform Competitors Yet Suffer Rapid Adversarial Jailbreaks

Table of Links

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

5. Results & Discussion

In this section, we first analyze the safety alignment of several SLMs followed by the results of sample-specific and transfer-based attacks, and also show the effectiveness of the TDNF defense.

5.1 Safety-aligned SLMs

We compare the efficacies of different SLMs trained using the SpeechVerse architecture, against a public SLM models SpeechGPT (Zhang et al., 2023) in Table 2. In addition, we also compare the performance of text-only pre-trained LLMs out-ofthe-box. We also compare fine-tuned Flan-T5-XL (3B) and Mistral-7B LLMs, safety aligned with the textual form of Spoken QA data.

Our results demonstrate the superior performance of our SLM models compared to public models, closely matching the performance of the best text-only LLMs on safety and relevance. As hypothesized, SLM models pre-adapted with ASR match or outperform their counterparts on all metrics demonstrating a better recognition of speech modality. We observe that the helpfulness of the SLM models is limited by the abilities of the pretrained LLM, although tuned with general instruction data during cross-modal adaptation. Furthermore, using our training mechanisms, we observe that we can retain almost all the helpfulness of pre-trained LLMs, while additionally infusing the abilities of spoken instruction understanding as well as safety alignment into SLMs.[12] Compared to SpeechGPT (Zhang et al., 2023), our best model shows more than 40% improvement in safety and 20% in helpfulness, demonstrating better recognition quality and speech instruction following capability. Although other public models like LLASM (Shu et al., 2023) and Pengi (Deshmukh et al., 2023) also have the capability to perceive speech instructions, we found those models to be not sufficiently safety aligned and hence left them out from our benchmarking.

5.2 Sample-specific white-box attacks

In Table 3, we present results of random noise perturbations at two SNR values, along with samplespecific adversarial attacks on four in-house trained SLM models. We report results only on the samples that were originally found to be safe for each model (as reported in Table 2) out of the 360 audios considered. Random perturbations demonstrate limited effectiveness in jailbreaking most models, with attack success rate below 8% for all models. In contrast, adversarial perturbations achieve a success rate (∼90%) in all cases at ∼60dB SPR. This shows that carefully crafted perturbations, even at small magnitudes can cause the models to produce unsafe responses[13]. Therefore more sophisticated speech-specific attacks that have been proposed to produce imperceptible perturbations are not necessary (Schönherr et al., 2018).

In Figure 4, we plot the cumulative proportion of successful attacks as a function of the number of attack iterations. We see that different models exhibit varying levels of susceptibility to adversarial jailbreaking attacks. For example, 80% of the successful attacks require fewer than 20 iterations for Mistral-based models, whereas attacks on the FlanT5-based models require upto 40 iterations.

Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.

[12] We study the effect of excluding general instruction tuning data for SLM training in Appendix A.4.

SLMs Outperform Competitors Yet Suffer Rapid Adversarial Jailbreaks | HackerNoon

Table of Links

5. Results & Discussion

5.1 Safety-aligned SLMs

5.2 Sample-specific white-box attacks

Leave a Reply Cancel reply

Stay Connected

Latest News

Xiaomi 14 Ultra to debut at the MWC 2024 · TechNode

Donald Trump’s NIH Pick Just Launched a Controversial Scientific Journal

What Is IaaS (Infrastructure as a Service)? 2025 Guide

How to Stay Positive at Work: Boost Productivity & Morale |

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

5. Results & Discussion

5.1 Safety-aligned SLMs

5.2 Sample-specific white-box attacks

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News