Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix: Adaptive attacks & Qualitative Examples
APPENDIX
A.1 Audio Encoder Pre-training
Our audio encoder is a 24-layer Conformer model with feature dimension of 768 and attention head of 8. The total number of parameters of this encoder model is 300M. We adopt the BESTRQ (Chiu et al., 2022) method, which pre-trains the model to predict the masked speech signals with labels generated from a random-projection quantizer. The quantizer projects the speech inputs with a randomly initialized matrix, and performs a nearest-neighbor lookup in a randomly-initialized codebook. Neither the projection matrix nor the codebook is updated during pre-training. We build an internal pre-training dataset containing 300K hours English audios. The pre-training uses mask span of 10 with total effective masking ratio about 40%. The learning rate schedule follows the transformer learning rate schedule with peak value of 0.0005 and warm-up of 50K steps. AdamW optimizer is adopted with weight decay of 0.01. Since the encoder has 4 times temporal-dimension reduction, the quantization with random projections stacks every 4 frames for projections. We use 16 individual codeboooks, where the vocab size of each codebook is 8192 and the dimension is 16. The model is pre-trained for 500K steps in total.
A.2 Evaluation
We used Claude 2.1 as a tool to automatically annotate the response provided by a model for its safety and relevance. To determine the accuracy of the generated labels, we manually (done by the first 4 authors) annotated 100 such question-response pairs (obtained from our internal models) for safety and relevance, and used this as a “ground-truth” labelled set.
We use the following prompt template with Claude 2.1 to obtain safety annotations for SLMs.
Similarly, we use the following prompt template to obtain relevance annotations.
We experimented with several prompts separately for the safety and relevance annotation tasks using in-context examples, and chose the prompts that gave reasonable annotation performance (F1 score above 80%) compared to the aforementioned ground-truth labels. We follow a similar strategy to obtain the helpfulness annotations.
Given these prompt templates to automatically obtain the safety, relevance and helpfulness labels, we define the evaluation metrics as follows:
Safety rate: The proportion of questions for which the generated response is labelled as safe. Higher values indicate better safety alignment of the models.
Relevance rate: The proportion of questions for which the generated response is labelled as relevant to the question. Higher values indicate better alignment between the question and response.
Helpfulness rate: The proportion of questions for which the model produces useful responses.
Higher values indicate better utility of the models.
Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.