PouledEverse Standardization of Sound encoding and LLM for the spoken QA

Links table
Part 1: Summary and introduction
Part 2: The background
Part 3: attacks and counter measures
Part 4: Experimental Preparation
Part 5: Data sets and evaluation
Part 6: attack, counter -procedure parameters, basic line: random disorders
Part 7: Results and discussion
Part 8: Transfer attacks and anti -measures
Part 9: Conclusion, restrictions, and clarification of morals
Part 10: Approach: Training and Evaluation on Sound encoding
Part 11: Approach: multiple attacks, training data, and the effect of random noise on help
Part 12: Approach: Adaptive attacks and qualitative examples
4. Experimental preparation
4.1 Models
We explain the unified SLM structure called pounderverse in Figure 3. It consists of two main elements: sound encryption and the great language model.
QA
Language model We use two types of LLMS available to the public in our study: (1) Flan-T5- XL (Chung et al 2023) With 7 billion teachers. While both models can follow the instructions, the latter matches or exceeding the performance of the parameter model 13 billion, such as Llama-2 (Touvron et al., 2023). It is worth noting that no LLMS is explicitly trained to be safe or not, so we draw their SLM counterparts and refer to them as S-Flant5 and S-Mistral in this work. We also explicitly bear the safety data data made of safety and refer to its SLM counterpart in the name S-Mistral-FT.
We would like to note that some famous LLMS like ChatGPT[3] And Claude 2.1[4] Do not support sound inputs on the shelf. It is necessary to make more control of this text LLMS only with the audio text data associated with enabling the form to understand the audio inputs. This requires access to model gradients for careful control. Therefore, we resort to the open -ended and source models in this work. However, we also showcase protection fracture attacks in both Black-Box (Pleasegpt) as well as in the white box settings.
4.2 Training
To enable SLMS to better understand the input sound, a two -stage training model is adopted: the method of pre -adaptation and the installation of transverse instructions (Zhang et al., 2023; Shu et al., 2023). In this work, we study SLMS coaches with two stages model as well as the Soming state model by performing the exhibition instructions directly to apply the operative quality guarantee. We use automatic recognition of speech (ASR) as a task before adapting to the intermediate. To our knowledge, our study is the first study that compares the competencies of the two models.
We reduce the mathematical costs associated with the long dimension of the sound method through the use of one warranty layers on sound encryption outputs (see Figure 3). For a two -stage training model, the first stage includes training classes and the weights of the audio encryption layer on ASR data, and the second stage includes increasing its control with the Lora transformers that have been randomly prepared (HU ET Al Set all the above-mentioned parameters to apply the spoken quality guarantee at the random preparation. ) And Mistral-7B as Backbone LLMS, respectively.
Although the concentration of this work in the first place is an understanding of the durability of SLMS for safety, the SLMS formulation with safety instructions data alone can lead to the forgetfulness of the catastrophic creatures that have been previously trained, especially the effect on SLM assistance against non -harmful instructions ( Zhao et al., 2023A). We address this problem by adopting the technology of operating experience (Wu et al
Implementation details All models are trained in this work using Pytorch (Paszke Et Al We use Hugingface (WOLF ET Al We tie two 1 -D 1Ds with lengths of 3 each, and soak 2 and 1, respectively. During training, we use the 512 and AdamW (Loshchilov and Hutev-2019 (Loshchilov and Hutter, 2019) with a 5E-3 learning rate. We use Lora transformers with 16th and Alpha 10.
Authors:
(1) Raghuver Peri, AWS AI LABS and Amazon and with equal contributions ([email protected]);
(2) Sai Muralidhar Jayanthi, AWS AI LABS, Amazon and with equal contributions;
(3) SRIKANTH RONNKI, AWS AI LABS, Amazon;
(4) Anshu Bhatia, AWS AI LABS, Amazon;
(5) Karel Mondnich, AWS AI LABS, Amazon;
(6) Dingliwal, AWS AI LABS, Amazon;
(7) Nilaksh Das, AWS AI LABS, Amazon;
(8) Zejiang Hou, AWS AI LABS, Amazon;
(9) Goeric Huybrechts, AWS AI LABS, Amazon;
(10) Srikanth Vishnubhotla, AWS AI LABS, Amazon;
(11) Daniel Garcia Romero, AWS AI LABS, Amazon;
(12) Sundarajan Srinivasan, AWS AI LABS, Amazon;
(13) Kyu J Han, AWS AI LABS, Amazon;
(14) Catherine Kirchov, AWS AI LABS, Amazon.
[2] We refer the reader to the A.1 appendix for more details about pre -training for audio encryption. [3] https://opeenai.com/index/chatgpt [4] https://www.anthropic.com/news/clade-2-1