Performance Of Best Of N Baseline For Various N And Sample Responses And GPT-4 Judgments

Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments | HackerNoon

Last updated: 2024/08/27 at 3:45 AM

News Room Published 27 August 2024

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions

A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

B DPO Implementation Details and Hyperparameters

C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

Additional Empirical Results

D.1 Performance of Best of N baseline for Various N

We find that the Best of N baseline is a strong (although computationally expensive, requiring sampling many times) baseline in our experiments. We include an evaluation of the Best of N baseline for various N for the Anthropic-HH dialogue and TL;DR summarization; the results are shown in Figure 4.

D.2 Sample Responses and GPT-4 Judgments

In this section, we present examples of comparisons between DPO and the baseline (PPO temp 0. for summarization, and the ground truth chosen response for dialogue). See Tables 4-6 for summarization examples, and Tables 7-10 for dialogue examples.

Figure 4: Best of N baseline for N = {1, 4, 16, 64, 128}. Performance plateaus after roughly 64-128 samples.

Table 7: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 0.7; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations. Table 7: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 0.7; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations.

Table 8: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 1.0; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations. Table 8: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 1.0; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations.

Table 9: GPT-4 chooses GT over DPO. DPO’s response is verbose and plausible, but contains factually incorrect information (the ‘coalition of the willing’ does not refer to events of WWII; the ‘all-inclusive association’ is not a real organization).

Table 10: GPT-4 chooses GT over DPO. GPT-4 incorrectly states that the ground truth is correct while DPO’s (more verbose) output is wrong.

Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments | HackerNoon

Table of Links

Additional Empirical Results

D.1 Performance of Best of N baseline for Various N

D.2 Sample Responses and GPT-4 Judgments

Leave a Reply Cancel reply

Stay Connected

Latest News

Today's NYT Strands Hints, Answer and Help for Nov. 24, #266 – CNET

Alibaba’s Animate Anyone makes historical people dance · TechNode

Apple expands Tap to Pay to New Zealand retailers

Jngx’ssvnfnnSsnVgsSksSugnuuus

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

Additional Empirical Results

D.1 Performance of Best of N baseline for Various N

D.2 Sample Responses and GPT-4 Judgments

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News