ICPL Baseline Methods: Disagreement Sampling And PrefPPO For Reward Learning

ICPL Baseline Methods: Disagreement Sampling and PrefPPO for Reward Learning | HackerNoon

Last updated: 2024/12/04 at 4:28 AM

News Room Published 4 December 2024

Table of Links

Abstract and Introduction
Related Work
Problem Definition
Method
Experiments
Conclusion and References

A. Appendix

A.1. Full Prompts and A.2 ICPL Details

A. 3 Baseline Details

A.4 Environment Details

A.5 Proxy Human Preference

A.6 Human-in-the-Loop Preference

A.3 BASELINE DETAILS

To sample trajectories for reward learning, we employ the disagreement sampling scheme from (Lee et al., 2021b) to enhance the training process. This scheme first generates a larger batch of trajectory pairs uniformly at random and then selects a smaller batch with high variance across an ensemble of preference predictors. The selected pairs are used to update the reward model.

For a fair comparison, we recorded the number of times PrefPPO queried the oracle human simulator to compare two trajectories and obtain labels during the reward learning process, using this as a measure of the human effort involved. In the proxy human experiment, we set the maximum number of human queries Q to 49, 150, 1.5k, and 15k. Once this limit is reached, the reward model ceases to update, and only the policy model is updated via PPO. Algo. 3 illustrates the pseudocode for reward learning.

Algo. 4 illustrates the pseudocode for PrefPPO.

Authors:

(1) Chao Yu, Tsinghua University;

(2) Hong Lu, Tsinghua University;

(3) Jiaxuan Gao, Tsinghua University;

(4) Qixin Tan, Tsinghua University;

(5) Xinting Yang, Tsinghua University;

(6) Yu Wang, with equal advising from Tsinghua University;

(7) Yi Wu, with equal advising from Tsinghua University and the Shanghai Qi Zhi Institute;

(8) Eugene Vinitsky, with equal advising from New York University ([email protected]).

ICPL Baseline Methods: Disagreement Sampling and PrefPPO for Reward Learning | HackerNoon

Table of Links

A.3 BASELINE DETAILS

Leave a Reply Cancel reply

Stay Connected

Latest News

NASA Develops Robotic Technologies for Autonomous Exploration of Ocean Worlds

Unity Software Inc (U) shares rise 6.26% on December 4

The wolf is losing part of its international armor. And it is largely due to Von der Leyen’s pony.

Trump picks two nominees who could decide the fate of Big Tech and crypto

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

A.3 BASELINE DETAILS

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News