Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix: Adaptive attacks & Qualitative Examples
4.3 Datasets
Training Data We avail 2.5K hours of in-house ASR speech-text parallel corpus for the modality pre-adaptation stage, which includes a mix of accents, speakers, sampling rates and background noises. We only utilize 2.5k hours of ASR training data due to compute limitations, but publicly available ASR data that are available in larger quantities can be a drop-in replacement. Since there is no publicly available data for the Spoken QA task with speech instruction and textual response pairs, similar to Zhang et al. (2023), we construct a training data of 160k speech-text pairs amounting to 150hrs of audio content using publicly available text-totext instruction tuning datasets and an in-house textto-speech (TTS) system. In particular, we combine TTS data of general instructions from Alpaca (Taori et al., 2023) and safety-aligned instruction from Moss (Sun et al., 2023) to train our SLMs. Due to the simplistic nature of our dataset construction, publicly available TTS services like Amazon Polly[5] can also be utilized to create training data pairs.
Evaluation Data To study the adversarial robustness of SLMs to harmful questions, we obtain spoken data for carefully curated list of harmful textual questions. Specifically, we derive 390 harmful questions presented by Shen et al. (2024) [6] belonging to 13 different categories such as physical harm, fraud, illegal activity, etc.[7] To determine the questions that are unambiguously harmful, we retained only the questions for which top two LLMs from leaderboard[8] declined to respond, resulting in a set of 180 questions. Finally, we collected human-read speech (from 15 unique en-US speakers) at both 8kHz and 16kHz audio sampling rates using these textual questions as transcripts, for a total of 360 audios.
To study the trade-off between helpfulness and harmlessness amongst SLMs, and to evaluate their usefulness against harmless questions, we construct another set of spoken questions non-overlapping with the 360 harmful set above. Particularly, we select 100 input questions at random from a publicly available textual instruction tuning dataset[9] consisting of different categories such as code generation, logical reasoning, math, text re-writing, etc. We obtain spoken format of these questions using the in-house TTS mentioned earlier.
4.4 Evaluation
Our preliminary analysis of SLM responses revealed that they are prone to misrecognition of speech inputs which can further lead to generation of irrelevant outputs, which may or may not be safe. On the other hand, even LLMs fail to generate helpful responses to harmless questions due to their limitations such as pre-trained knowledge, thereby transferring this behaviour to SLMs. Table 1 demonstrates such examples with appropriate labels. Thus, to comprehensively understand the efficacies of SLMs, we benchmark them for three different metrics: safety, relevance, helpfulness. We utilize the set of 360 spoken harmful questions to measure safety and relevance, and the set of 100 spoken helpful questions to measure helpfulness.
To handle evaluation of the substantial number of responses from different SLMs, we employ Claude 2.1 for automatic labeling due to its superior performance on public safety leaderboards.[10] We use Claude 2.1 off-the-shelf without any further tuning or customization. This method enables us to effectively compare different SLMs as well as employ it for identifying jailbroken instances at par with crowd-sourcing (Alizadeh et al., 2023; Shen et al., 2024; Tan et al., 2024).
In our evaluation approach, we first manually annotate 100 question-answer pairs for safety and relevance labels. Next, we experimented with different prompts appropriate for the Claude 2.1 model and providing in-context learning (Min et al., 2022) examples to predict these labels. For different prompts and in-context learning examples, we measured the F1 scores between predicted labels and ground truth and pick the ones that gave us the highest F1 scores. We refer the reader to Appendix A.2 for more details on our evaluation methodology and metric definitions.
To ascertain the attack success rate and perceptibility of the attacks, we define the following metrics: jailbreak success rate and signal-to-perturbation ratio. We consider a response as jailbroken only if it is both unsafe and relevant as determined the Claude annotation tool. This metric captures the effectiveness of the attacker, with higher values denoting better attack (higher model vulnerability).
Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.
[5] https://aws.amazon.com/polly/
[6] https://github.com/verazuo/jailbreak_llms/ blob/main/data/questions.csv
[7] We utilize all but the pornography category.
[8] https://huggingface.co/spaces/AI-Secure/ llm-trustworthy-leaderboard
[9] https://huggingface.co/datasets/ignmilton/ ign_clean_instruct_dataset_500k