Markov Chains, Rewards & Rules | HackerNoon

Table of Links

Abstract and 1. Introduction and Related Work

Methodology

2.1 LLM-Sim Task

2.2 Data

2.3 Evaluation
Experiments
Results
Conclusion
Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

2 Methodology

We examine the abilities of LLMs to serve as world simulators in text-based virtual environments, in which an agent receives observations and proposes actions in natural language in order to complete certain objectives. Each text environment can be formally represented as a goal-conditioned partially observable Markov decision process (POMDP) (Kaelbling et al., 1998) with the 7-tuple (S, A, T , O, R, C, D), where S denotes the state space, A denotes the action space, T : S × A → S denotes the transition function, O denotes the observation function, R : S × A → R denotes the reward function, C denotes a natural language “context message” that describes the goal and action semantics, and D : S × A → {0, 1} denotes the binary completion indicator function.

2.1 LLM-Sim Task

In practice, the whole state transition simulator F should consider two types of state transitions: action-driven transitions and environment-driven transitions. For the example in Figure 1, the action-driven transition is that the sink is turned on (isOn=true) after taking the action turn on sink, and the environment-driven transition is that water fills up the cup in the sink when the sink is on. To better understand LLM’s ability to model each of these transitions, we further decompose the simulator function F into three steps:

2.2 Data

Additional Context: Each game also includes a context message, c, that provides additional information to the model. The context consists of four parts: action rules describing the effect of each action on the game state, object rules describing the meaning of each object property and whether they are affected by the game’s underlying dynamics, scoring rules describing how an agent earns reward and the conditions under which the game is won or lost, and one or two example transitions (see Appendix B for details) from the held-out game mentioned above. For each game we generate three

Table 3: GPT-4 game progress prediction results

versions of the context, one where the rules are written by a human expert (one of the game authors), and one where they are produced by an LLM with access to the game code, and one where no rules are provided. See Appendix C for additional details.

2.3 Evaluation

Performance on LLM-Sim is determined by the model’s prediction accuracy w.r.t. the ground truth labels over a dataset of test samples. Depending on the experimental condition, the LLM must model object properties (when simulating Fact, Fenv, or F) and / or game progress (when simulating FR or F), defined as:

Object Properties: a list of all objects in the game, along with each object’s properties (e.g., temperature, size) and relationships to other objects (e.g., being within or on top of another object).

Game Progress: the status of the agent w.r.t. the overall goal, consisting of the current accumulated reward, whether the game has terminated, and whether the overall goal has been achieved.

:::info
Authors:

(1) Ruoyao Wang, University of Arizona ([email protected]);

(2) Graham Todd, New York University ([email protected]);

(3) Ziang Xiao, Johns Hopkins University ([email protected]);

(4) Xingdi Yuan, Microsoft Research Montréal ([email protected]);

(5) Marc-Alexandre Côté, Microsoft Research Montréal ([email protected]);

(6) Peter Clark, Allen Institute for AI ([email protected]).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI ([email protected]).

:::

:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

Markov Chains, Rewards & Rules | HackerNoon

Table of Links

2 Methodology

2.1 LLM-Sim Task

2.2 Data

2.3 Evaluation

Leave a Reply Cancel reply

Stay Connected

Latest News

How to Use Facebook Groups to Grow Your Business – Blog

You can search your PC in an instant with this tiny app

Nikon Coolpix P1100 Review

In Uganda, a vibrant tourist economy boosts the conservation of endangered mountain gorillas

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

2 Methodology

2.1 LLM-Sim Task

2.2 Data

2.3 Evaluation

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News