How CODEX Model Size Influences COCOGEN's Output Quality

How CODEX Model Size Influences COCOGEN’s Output Quality | HackerNoon

Last updated: 2025/04/24 at 12:56 PM

News Room Published 24 April 2025

Table of Links

Abstract and 1 Introduction

2 COCOGEN: Representing Commonsense structures with code and 2.1 Converting (T,G) into Python code

2.2 Few-shot prompting for generating G

3 Evaluation and 3.1 Experimental setup

3.2 Script generation: PROSCRIPT

3.3 Entity state tracking: PROPARA

3.4 Argument graph generation: EXPLAGRAPHS

4 Analysis

5 Related work

6 Conclusion, Acknowledgments, Limitations, and References

A Few-shot models size estimates

B Dynamic prompt Creation

C Human Evaluation

D Dataset statistics

E Sample outputs

F Prompts

G Designing Python class for a structured task

H Impact of Model size

I Variation in prompts

G Designing Python class for a structured task

Figure 7 shows three different designs for Explagraphs. For PROSCRIPT, the various formats include representing proscript as a Networkx[8] class (8), DOT-like class 9, and as a Tree (10).

H Impact of Model size

The CODEX model released by OpenAI is available in two versions[9]: code-davinci-001 and code-davinci-002. While the exact sizes of the models are unknown because of their proprietary nature, OpenAI API states that code-davinci-002 is the Most capable Codex model Tables 16 and ?? compares COCOGEN +code-davinci-001 with COCOGEN +code-davinci-002. Note that both code-davinci-001 and code-davinci-002 can fit 4000 tokens, so the number of in-context examples was identical for the two settings. The results show that for identical prompts, COCOGEN +code-davinci-002 vastly outperforms COCOGEN +code-davinci-001, showing the importance of having a better underlying code generation model.

Figure 5: Example graphs for each of the tasks used for COCOGEN: PROSCRIPT (top-left), EXPLAGRAPHS (topright), and PROPARA (bottom).

Table 13: Performance of CODEX on the three different formats present in Figure 7 for EXPLAGRAPHS.

Table 14: Performance of CODEX-001 and CODEX002 on the the different formats present in Figure 10 and 9 for PROSCRIPT edge prediction. We find that the literal format that combines structure with literally Figure output performs the best for CODEX-002.

Model size vs. sensitivity to the prompt In Table 14 shows the performance of CODEX-001 (smaller) and CODEX-002 (larger, also see Appendix A) on identical prompts. Our experiments show that as model size increases, the sensitivity of the model on the prompt design might get progressively easier.

I Variation in prompts

We run each experiment with 4 different random seeds, where the random seeds decide the order of examples in the prompt. We find minimal variance between runs using different fixed prompts between 3 runs. Further, as shown in the Table 18, 19, 20, and 21, all improvements of COCOGEN over DAVINCI are statistically (p-value < 0.001).

Figure 6: A PROSCRIPT plan (top) and the corresponding Python code (bottom).

Table 18: PROSCRIPT script generation: mean and standard deviation across three different random seeds.

Table 21: PROPARA: mean and standard deviation across three different random seeds.

Table 19: PROSCRIPT edge prediction: mean and standard deviation across three different random seeds.

Table 15: CODEX results on PROSCRIPT generation for various Python source formats.

Figure 7: Templates tried for explagraph.

Table 16: CODEX-001 vs 002 on PROSCRIPT script generation

Figure 8: Proscript as a Networkx class.

Figure 9: Representing PROSCRIPT graph literally.

Table 20: EXPLAGRAPHS: mean and standard deviation across three different random seeds.

Figure 10: Proscript with a tree-encoding.

[9] as of June 2022

Authors:

(1) Aman Madaan, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(2) Shuyan Zhou, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(3) Uri Alon, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(4) Yiming Yang, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(5) Graham Neubig, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]).

How CODEX Model Size Influences COCOGEN’s Output Quality | HackerNoon

Table of Links

G Designing Python class for a structured task

H Impact of Model size

I Variation in prompts

Leave a Reply Cancel reply

Stay Connected

Latest News

China’s state broadcaster takes small stake in iQIYI as revenue falls · TechNode

Woman arrested after newborn baby found decapitated and dismembered in bin

Tech firms suggested placing trackers under offenders’ skin at meeting with justice secretary

10 Best Enterprise Search Software Solutions in 2025 |

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

G Designing Python class for a structured task

H Impact of Model size

I Variation in prompts

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News