Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix: Adaptive attacks & Qualitative Examples
A.6 Adaptive attacks
In this section, we report the results using adaptive attacks, where the attacker has knowledge of any defense mechanism employed in the system. We employ α=0.0001 (Eq. 1) as we found that the attacker needs a larger step size in the presence of a defense to produce successful attacks. From Table 10, we see that the attacks become less successful in the presence of a defense. Also, the adaptive attacker needs to add much more perceptible perturbations (lower average SPR) in the presence of a defense. This clearly shows that a simple pre-processing defense can provide some degree of robustness against adversarial attacks.
Also, from Figure 5, we observe that the presence of a defense in the system makes the attacks less effective under limited attack budgets. For a given attack budget of say T=50 iterations, only 60% of the attacks are successful on the system with TDNF defense, compared to ∼80% for a system without defense. However, note that these attacks were performed with a limited attack budget of T=100 iterations. A malicious actor with a larger attack budget can potentially produce a higher jailbreak rate.
A.7 Qualitative Examples
Table 11 compares an in-house and a public SLM model responses on harmful examples. We showcase scenarios where models produce safe content although irrelevant, as well as safe content with relevant understanding of the input audio. Overall, the in-house SLM model demonstrate better speech comprehension ability.
Table 12 compares models on various helpfulness questions across different aspects of usefulness. We notice that the in-house SLM model SMistral-FT sometimes errs on the side of caution, indicating a healthy tension between harmlessness and helpfulness. We leave further explorations of such properties of SLM models to future work. On the other hand, we notice the importance of a strong audio understanding ability in an SLM, as failing to do so can impact the usefulness of an SLM model by mistaking entity names in the input audio.
Table 13 showcases examples of jailbroken responses and the corresponding SPRs. We clearly see that the model produces safe responses adhering to its safety training without attack, but even minimal perturbations can cause it to produce unsafe responses. In some cases (last 2 examples), the model begins its response with a safety aligned response, but generates harmful content subsequently. This further demonstrates the need for thorough studies on model safety, and a cursory analysis may be insufficient.
Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.