Safety Alignment And Jailbreak Attacks Challenge Modern LLMs

Table of Links

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

2. Background

Safety alignment Considering the broad capabilities of LLMs, concerns have emerged about their potential to cause harm (Bender et al., 2021; Bommasani et al., 2021), sparking discussions on aligning these systems to human values and ethics (Hendrycks et al., 2020). Askell et al. (2021) propose three criteria–helpfulness, honesty, and harmlessness (HHH)–to which a properlyaligned system should adhere. To train systems in accordance with these criteria, LLM developers employ safety training mechanisms. First, models are trained on large amounts of data for general language capabilities, followed by a safety training stage to deter the system from responding to harmful questions (Askell et al., 2021; Ouyang et al., 2022). The examples used for safety alignment training are typically hand-crafted by dedicated redteams that are tasked with constructing prompts to jailbreak (Shen et al., 2024; Wei et al., 2023).

Jailbreak attacks on LLMs Inie et al. (2023) outlines several prompting strategies that are typically used in jailbreaking LLMs. However, the prompts therein are manually handcrafted on a caseby-case basis, hindering their large-scale adoption. Moreover, such prompts become irrelevant after safety training, requiring newer strategies (Inie et al., 2023). Recently, automatic prompt engineering techniques have been explored (Shin et al., 2020; Zou et al., 2023). In particular, Zou et al. (2023) demonstrate the use of adversarial attacks to jailbreak LLMs. In addition to white-box attacks which assume full access to the models, they show that a careful combination of techniques can produce perturbations that are transferable to commercial models for which only an API is exposed. More recently, Wichers et al. (2024) proposed a gradient-based technique to automatically learn red-teaming data for model evaluation and alignment. However, these methods rely on discrete optimization techniques or approximation tricks, which are computationally expensive, and may not generalize well.

Jailbreak attacks on multi-modal LLMs Unlike text-based jailbreak attacks, which require discrete optimization techniques, systems operating on continuous domain signals such as images, audio, etc. can be more readily attacked (Goodfellow et al., 2014; Jati et al., 2021), and therefore are more vulnerable to adversarial threats (Qi et al., 2023). In addition to adversarial perturbations, other approaches like prompt injection (Bagdasaryan et al., 2023) and model poisoning (Zhai et al., 2023) have also been studied as alternatives to compromise safety of multi-modal LLMs. Recent studies have demonstrated that adversarial attacks only on vision encoders (without access to the LLM) are just sufficient to jailbreak VLMs (Zhao et al., 2023b; Dong et al., 2023). Previous studies have also demonstrated that adversarial perturbations generated on images being fed into VLMs break their safety alignment, and also transfer to different models in a black-box setup (Qi et al., 2023). In this work, we follow a similar approach and generate adversarial perturbations to speech input for safety aligned SLMs. In this way, we characterize the vulnerability of spoken-instruction following SLM models.

Jailbreak Evaluations Previous studies have conducted jailbreak evaluations either through human annotation (Wei et al., 2023; Qi et al., 2023), toxicity assessment (Carlini et al., 2023; Wichers et al., 2024), string matching (Zou et al., 2023), supervised classifiers (Wichers et al., 2024), or a preference LLM as a judge (Chao et al., 2023; Shen et al., 2024). Techniques other than the latter are either computationally expensive or requires tedious human involvement which is time-intensive as well as costly when scaled. Moreover, some studies have shown that a well-trained preference model can significantly contribute to evaluating whether an LLM is jailbroken (Wang et al., 2023). Therefore, in this work, we use a preference LLM judge to assess the safety and utility of SLMs.

Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.

Safety Alignment and Jailbreak Attacks Challenge Modern LLMs | HackerNoon

Table of Links

2. Background

Leave a Reply Cancel reply

Stay Connected

Latest News

Databricks buys the AI-powered data migration startup BladeBridge – News

Hackers Exploiting SimpleHelp RMM Flaws for Persistent Access and Ransomware

Get the ultimate all-in-one Windows + Microsoft software upgrade for £44

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

2. Background

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News