Adaptive Attacks Expose SLM Vulnerabilities And Qualitative Insights

Table of Links

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

A.6 Adaptive attacks

In this section, we report the results using adaptive attacks, where the attacker has knowledge of any defense mechanism employed in the system. We employ α=0.0001 (Eq. 1) as we found that the attacker needs a larger step size in the presence of a defense to produce successful attacks. From Table 10, we see that the attacks become less successful in the presence of a defense. Also, the adaptive attacker needs to add much more perceptible perturbations (lower average SPR) in the presence of a defense. This clearly shows that a simple pre-processing defense can provide some degree of robustness against adversarial attacks.

Also, from Figure 5, we observe that the presence of a defense in the system makes the attacks less effective under limited attack budgets. For a given attack budget of say T=50 iterations, only 60% of the attacks are successful on the system with TDNF defense, compared to ∼80% for a system without defense. However, note that these attacks were performed with a limited attack budget of T=100 iterations. A malicious actor with a larger attack budget can potentially produce a higher jailbreak rate.

A.7 Qualitative Examples

Table 11 compares an in-house and a public SLM model responses on harmful examples. We showcase scenarios where models produce safe content although irrelevant, as well as safe content with relevant understanding of the input audio. Overall, the in-house SLM model demonstrate better speech comprehension ability.

Table 12 compares models on various helpfulness questions across different aspects of usefulness. We notice that the in-house SLM model SMistral-FT sometimes errs on the side of caution, indicating a healthy tension between harmlessness and helpfulness. We leave further explorations of such properties of SLM models to future work. On the other hand, we notice the importance of a strong audio understanding ability in an SLM, as failing to do so can impact the usefulness of an SLM model by mistaking entity names in the input audio.

Table 13 showcases examples of jailbroken responses and the corresponding SPRs. We clearly see that the model produces safe responses adhering to its safety training without attack, but even minimal perturbations can cause it to produce unsafe responses. In some cases (last 2 examples), the model begins its response with a safety aligned response, but generates harmful content subsequently. This further demonstrates the need for thorough studies on model safety, and a cursory analysis may be insufficient.

Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.

Adaptive Attacks Expose SLM Vulnerabilities and Qualitative Insights | HackerNoon

Table of Links

A.6 Adaptive attacks

A.7 Qualitative Examples

Leave a Reply

Table of Links

A.6 Adaptive attacks

A.7 Qualitative Examples

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply