Explore the advanced styles of good LLM

Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) CHING-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santirro, Microsoft Research;
(5) Ahmed Awad Allah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang XIE, Microsoft Research and Comfortance to [email protected].
Links table
Abstract and 1 introduction
2 introductory
2.1 RLHF based on reward models
2.2 RLHF with general preferences
3 Receive Nash Live and 3.1 Derivatives of algorithm 1
3.2 Theoretical analysis
4 Practical algorithm-Self-improving Repetition
5 experiments and 5.1 experimental preparation
5.2 results and analysis
6 related work
7 conclusion and references
Excessive
Extension of organized preferences
B detailed proofs
C additional experimental details
We divide the relevant workspace into whether or not the technologies are using SFT or contradictory losses, in the update settings that are not connected to the Internet or online.
RLHF online algorithms: RLHF has invented how linguistic models are aligned with human preferences (Christiano et al
SFT bonus bonus: Since the introduction of RLHF, many emerging technologies have applied bonus models in different ways, such as liquidating training data or classification responses. The bonus is ranked first (RAFT) (Dong et al This is similar to the technique of repetitive behavior of the dagger (Ross et al., 2011).
Learning contradictory preference in non -communication mode: There are many losses for learning contradictory preferences, which were first presented in the internet setup, which is to improve the direct preference (RAFAILOV et Al Azar et al. (2023) He explained that wise reward estimates are not a substitute for the wise preferences of the husband, and that politics can easily overcome inevitable preferences without the appropriate organization. It derives a more general goal for RLHF, public subscription, to improve the possibilities of preference without direct contact.
The improvement of statistical samples (RSO) generates multiple samples of a preliminary model, and classifies them to create training pairs, and improve them under a unified framework that includes DPO and Slic (Liu et al., 2024B). Inspired by literature learning to the rank, the improvement of preferences (LIPO) extends over the preference of the husband to the list (Liu et al., 2024A). Classification of preferences (Pro) also learns to wise preferences (Song et al., 2024). The KTO algorithm takes a different approach from DPO and does not assume that a pair of VS-Bad outputs for the same inputs exists, but rather is from good outputs and a set of bad outputs of any inputs and improves the “non-preserved” loss (Ethayarajh et al., 2024).
Repeating bonuses based on rewards: Augmented self -training (REST) is one of the first methods for exploring repetitive training strategies to improve self -improvement as a “growth” step in two phases that are used from current policy samples, and the “improvement” step that uses a model for bonus to filter high -quality samples from ever used to improve policy with RL (Gulcehre Et Al., 2023). The follow -up work explores the use of artificial intelligence notes instead of classifying the bonuses (Singh et al., 2023).
Learning to contradict politics: Luan et al They are studying the benefits of frequent repeated training over the preferences derived from the outputs from which samples have been taken for the last policy, but in their work, they use the policy itself as an explanation, which begins with the ability to provide only weak preference signals. The preferential and nickname techniques, nicknamed preference and aggressive preference, are also known as the nickname and aggressive preference known as APO (Cheng Et Al
ADOLPHS ET Al The loss of the couple (Xu et al
Improving general preference over politics: Wang and others. (2023) Consider finding the Von Neumann winner with general preferences via multi -agent RL from theoretical perspective. Nash-MD improves a policy towards the Nash balance of the generalized preference model using policy gradients, which indicates that by taking samples from a mixture of policies, one can meet Nash balance in the last repetition (Munos et al., 2023). Another game of self -playing preferences (SPO) is an other game of miniature players that converge with Nash balance with non -collapse guarantees (Swamy et al., 2024). However, these technologies are not effective for data like contradictory losses and are honestly difficult to implement without an exhausting update of animals (Munos et al., 2023). Classy improvement, IPO-MD, reduces these difficulties using purely public subscription updates and is experimentally evaluated on the Calandriello et al., 2024. Guo et al. (2024) It is also suggested to eliminate bonuses in retracting artificial intelligence over the Internet (OAIF) using another LLM to comment on any of the two outputs appointed online from the current policy is preferred. However, all the above studies only think about training pairs created between the “student for the student” samples for self -to -play, between the student and the first πref. That is, there is no more powerful “teacher” concept that can be compared to training pairs. In Table 2, we showed that deleting the “student versus the teacher” preferences may hinder performance.