By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Exploring Cutting-Edge Approaches to Iterative LLM Fine Tuning | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Exploring Cutting-Edge Approaches to Iterative LLM Fine Tuning | HackerNoon
Computing

Exploring Cutting-Edge Approaches to Iterative LLM Fine Tuning | HackerNoon

News Room
Last updated: 2025/04/16 at 3:40 PM
News Room Published 16 April 2025
Share
SHARE

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Table of Links

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References

Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

We divide the space of related work into whether or not the techniques use SFT or contrastive losses, in offline or online update settings.

Online RLHF algorithms: RLHF innovated how to align language models with human preferences (Christiano et al., 2017; Stiennon et al., 2020), but it is unstable to train and memory-intensive, requiring all three of the parameterized policy model, reward model, and advantage model to be on device for training.

Reward-model Augmented SFT: Since the introduction of RLHF, several emergent techniques apply reward models in various ways, such as to filter training data or rank responses. Reward rAnked Finetuning (RAFT) (Dong et al., 2023) and RRHF (Yuan et al., 2023b) offer the conceptually simplest solution for offline preference learning, which is to sample multiple outputs from a policy, rank them with a reward model, and then finetune on the best sampled output using SFT. This resembles the iterative behavior-cloning technique DAgger (Ross et al., 2011).

Offline Contrastive Preference Learning: There exist several loss functions for contrastive preference learning, first introduced in the offline setting, namely Direct Preference Optimization (Rafailov et al., 2023, DPO) and Calibrated Sequence Likelihood Estimation a.k.a. SLiC (Zhao et al., 2023). Azar et al. (2023) make it clear that point-wise reward estimates are no substitute for pair-wise preferences, and that a policy can easily overfit to deterministic preferences without proper regularization. They derive a more general objective for RLHF, IPO, to directly optimize offline preference probabilities.

Statistical Rejection Sampling Optimization (RSO) generates multiple samples from an initial model, ranks them to create training pairs, and optimizes them under a unified framework encompassing DPO and SLiC (Liu et al., 2024b). Inspired by the learning-to-rank literature, Listwise preference optimization (LIPO) extends pair-wise preference learning to list-wise (Liu et al., 2024a). Preference Ranking Optimization (PRO) also learns towards list-wise preferences (Song et al., 2024). The KTO algorithm takes a different approach from DPO and does not assume that a pair of good-vs-bad outputs for the same input exist, but rather a pool of good outputs and a pool of bad outputs for any inputs exist and optimizes an “unpaired” loss (Ethayarajh et al., 2024).

Iterative Reward-based Finetuning: Reinforced Self-Training (ReST) is one of the first methods to explore iterative self-improving training strategies framed as a two-stage “Grow” step that samples from the current policy, and a “Improve” step that uses a reward model to filter ever-higher quality samples that are then used to improve the policy with offline RL (Gulcehre et al., 2023). A follow-up work explores the use of AI feedback rather than reward ranking (Singh et al., 2023).

On-policy Contrastive Learning: Self-Rewarding Language Models (Yuan et al., 2024) is in practice very similar to DNO. They study the benefits of batched iteratively training on preferences derived from a recent policy’s sampled outputs, but in their work, they use the policy itself as the annotator, which starts off being able to provide only weak preference signals. Self-Play Fine-Tuning (Chen et al., 2024) a.k.a SPIN and Adversarial Preference Optimization a.k.a APO (Cheng et al., 2023) are both iterative LLM training techniques that are compatible with contrastive losses, but they make a very limiting assumption that the teacher is better than the student (without regard to any annotator feedback).

The Cringe Loss (Adolphs et al., 2022) is a token-level loss function that contrasts the correct next token with a hard-negative token from the vocabulary that has high logit weight but still incorrect. The Pairwise Cringe Loss (Xu et al., 2023b) applies the cringe loss to an iterative self-improving style of training.

On-Policy General Preference Optimization: Wang et al. (2023) consider finding the von Neumann winner of general preferences via multi-agent RL from the theoretical perspective. Nash-MD optimizes a policy towards the Nash equilibrium of a generalized preference model using policy gradients, showing that by sampling from a mixture of policies, one can converge to the Nash equilibrium in the last iteration (Munos et al., 2023). Self-play Preference Optimization (SPO) is another online two-player mini-max game that converges to a Nash equilibrium with no-regret guarantees (Swamy et al., 2024). However, these techniques are not as data efficient as contrastive losses and are difficult to implement faithfully without cumbersome two-timescale updates (Munos et al., 2023). A concurrent improvement, IPO-MD, mitigates these difficulties by using purely on-policy IPO updates and is empirically evaluated on an article summarization task (Calandriello et al., 2024). Guo et al. (2024) also propose to eliminate rewards in online AI-feedback (OAIF) by using another LLM to annotate which of two online-sampled outputs from the current policy is preferred. However, all the above studies only consider training pairs constructed between self-play “student vs student” samples, and between student and initial πref. That is, there is no concept of a more powerful “teacher” to compare against in their training pairs. We showed in Table 2 that omitting these “student vs teacher” preferences may hinder performance.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Navigating the US-PRC tech competition in the Global South
Next Article Xbox app update lets you buy and install games on your console
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

This crazy thin foldable embarrasses the Galaxy Z Fold 7 even before its launch
News
AI job predictions become corporate America’s newest competitive sport | News
News
They Clicked, You Blinked: The Lost Art Of Post-Click Content Experience
Computing
How $10m case against Diddy collapsed after prosecutors got greedy
News

You Might also Like

Computing

They Clicked, You Blinked: The Lost Art Of Post-Click Content Experience

16 Min Read
Computing

Qualcomm launches Snapdragon 8s Gen4, adopted first by Xiaomi and Oppo · TechNode

1 Min Read
Computing

LG Display transfers 8.5-gen LCD plant in Guangzhou to TCL CSOT for $1.52 billion · TechNode

1 Min Read
Computing

DeepSeek AI supports Myanmar earthquake relief efforts · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?