Reading: GPT-4 vs GPT-3.5 Performance in Game Simulations | HackerNoon

GPT-4 vs GPT-3.5 Performance in Game Simulations | HackerNoon

Last updated: 2025/09/25 at 1:16 AM

News Room Published 25 September 2025

Table of Links

Abstract and 1. Introduction and Related Work

Methodology

2.1 LLM-Sim Task

2.2 Data

2.3 Evaluation
Experiments
Results
Conclusion
Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

D Prompts

The prompts introduced in this section includes game rules that can either be human written rules or LLM generated rules. For experiments without game rules, we simply remove the rules from the corresponding prompts.

D.1 Prompt Example: Fact

D.1 Prompt Example: Fact

D.1.2 State Difference Prediction

D.2 Prompt Example: Fenv

D.2.1 Full State Prediction

D.2.2 State Difference Prediction

D.3 Prompt Example: FR (Game Progress)

D.4 Prompt Example: F

D.4.1 Full State Prediction

D.4.2 State Difference Prediction

D.5 Other Examples

Below is an example of the rule of an action:

Below is an example of the rule of an object:

Below is an example of the score rule:

Below is an example of a game state:

Table 6: GPT-3.5 game progress prediction results

Below is an example of a JSON that describes the difference of two game states:

E GPT-3.5 results

Table 5 and Table 6 shows the performance of a GPT-3.5 simulator predicting objects properties and game progress respectively. There is a huge gap between the GPT-4 performance and GPT-3.5 performance, providing yet another example of how fast LLM develops in the two years. It is also worth notices that the performance difference is larger when no rules is provided, indicating that GPT-3.5 is especially weak at applying common sense knowledge to this few-shot world simulation task.

F Histograms

1. In Figure 3, we show detailed experimental results on the full state prediction task performed by GPT-4.

Table 7: Description of object properties mentioned in Figure 2

2. In Figure 4, we show detailed experimental results on the state difference prediction task performed by GPT-4.

3. In Figure 5, we show detailed experimental results on the full state prediction task performed by GPT-3.5.

4. In Figure 6, we show detailed experimental results on the state difference prediction task performed by GPT-3.5.

(a) Human-generated rules.

(b) LLM-generated rules.

(c) No rules.

Figure 3: GPT-4 – Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

(a) Human-generated rules.

(b) LLM-generated rules.

(c) No rules.

Figure 4: GPT-4 – Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

(a) Human-generated rules.

(b) LLM-generated rules.

(c) No rules.

Figure 5: GPT-3.5 – Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

(a) Human-generated rules.

(b) LLM-generated rules.

(c) No rules.

Figure 6: GPT-3.5 – Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

:::info
Authors:

(1) Ruoyao Wang, University of Arizona ([email protected]);

(2) Graham Todd, New York University ([email protected]);

(3) Ziang Xiao, Johns Hopkins University ([email protected]);

(4) Xingdi Yuan, Microsoft Research Montréal ([email protected]);

(5) Marc-Alexandre Côté, Microsoft Research Montréal ([email protected]);

(6) Peter Clark, Allen Institute for AI ([email protected]).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI ([email protected]).

:::

:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

Share This Article

Previous Article

Qualcomm’s New Mobile Chip Could Give Android Phones A Boost Over The iPhone 17 Pro – BGR

Qualcomm’s New Mobile Chip Could Give Android Phones A Boost Over The iPhone 17 Pro – BGR

PlayStation’s new wireless speakers are for your desktop

PlayStation’s new wireless speakers are for your desktop

Leave a comment

Leave a Reply Cancel reply