By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Study Shows Few-Shot Code Generation Outperforms Fine-Tuned Models | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Study Shows Few-Shot Code Generation Outperforms Fine-Tuned Models | HackerNoon
Computing

Study Shows Few-Shot Code Generation Outperforms Fine-Tuned Models | HackerNoon

News Room
Last updated: 2025/04/23 at 3:29 AM
News Room Published 23 April 2025
Share
SHARE

Table of Links

Abstract and 1 Introduction

2 COCOGEN: Representing Commonsense structures with code and 2.1 Converting (T,G) into Python code

2.2 Few-shot prompting for generating G

3 Evaluation and 3.1 Experimental setup

3.2 Script generation: PROSCRIPT

3.3 Entity state tracking: PROPARA

3.4 Argument graph generation: EXPLAGRAPHS

4 Analysis

5 Related work

6 Conclusion, Acknowledgments, Limitations, and References

A Few-shot models size estimates

B Dynamic prompt Creation

C Human Evaluation

D Dataset statistics

E Sample outputs

F Prompts

G Designing Python class for a structured task

H Impact of Model size

I Variation in prompts

3 Evaluation

We experiment with three diverse structured commonsense generation tasks: (i) script generation (PROSCRIPT, Section 3.2), (ii) entity state tracking (PROPARA, Section 3.3), and (iii) explanation graph generation (EXPLAGRAPHS, Section 3.4) Dataset details are included in Appendix D. Despite sharing the general goal of structured commonsense generation, the three tasks are quite diverse in terms of the generated output and the kind of required reasoning.

3.1 Experimental setup

Model As our main Code-LLM for COCOGEN, we experiment with the latest version of CODEX code-davinci-002 from OpenAI[1] in few-shot prompting mode.

Baselines We experimented with the following types of baselines:

  1. Text few-shot: Our hypothesis is that codegeneration models can be repurposed to generate structured output better. Thus, natural baselines for our approach are NL-LLMs – language models trained on natural language corpus. We experiment with the latest versions of CURIE (text-curie-001) and DAVINCI (text-davinci-002), the two GPT-3 based models by OpenAI (Brown et al., 2020). For both these models, the prompt consists of (T , G) examples, where G is simply flattened into a string (as in Figure 1c). DAVINCI is estimated to be much larger in size than CURIE, as our experiments also reveal (Appendix A). DAVINCI, popularly known as GPT-3, is the strongest text generation model available through OpenAI APIs.[2]

  2. Fine-tuning: we fine-tune a T5-large model for EXPLAGRAPHS, and use the results from Sakaguchi et al. (2021) on T5-xxl for PROSCRIPT tasks. In contrast to the few-shot setup where the model only has access to a few examples, fine-tuned models observe the entire training data of the downstream task.

Choice of prompt We created the prompt p by randomly sampling k examples from the training set. As all models have a bounded input size (e.g., 4096 tokens for CODEX code-davinci-002 and 4000 for GPT-3 text-davinci-002), the exact value of k is task dependent: more examples can fit in a prompt in tasks where (T , G) is short. In our experiments, k varies between 5 and 30, and the GPT-3 baseline is always fairly given the same prompts as CODEX. To control for the variance caused by the specific examples selected into p, we repeat each experiment with at least 3 different prompts, and report the average. We report the mean and standard deviations in Appendix I.

COCOGEN: We use COCOGEN to refer to setups where a CODEX is used with a Python prompt. In Section 4, we also experiment with dynamically creating a prompt for each input example, using a NL-LLMs with code prompts, and using CodeLLMs with textual prompts.

3.2 Script generation: PROSCRIPT

Given a high-level goal (e.g., bake a cake), the goal of script generation is to generate a graph where each node is an action, and edges capture dependency between the actions (Figure 1a). We use the PROSCRIPT (Sakaguchi et al., 2021) dataset, where the scripts are directed acyclic graphs, which were collected from a diverse range of sources including ROCStories (Mostafazadeh et al., 2016), Descript (Wanzare et al., 2016), and Virtual home (Puig et al., 2018).

Figure 1 shows an input-output example from PROSCRIPT, and our conversion of the graph into Python code: we convert each node v ∈ V into an instance of a Node class; we create the edges by adding children attribute for each of the nodes. Additional examples are present in Figure 6

To represent a sample for edge prediction, we list the nodes in a random order (specified after the comment # nodes in Figure 1b). The model then

Table 1: Semantic and structural metrics for the script generation task on PROSCRIPT. T5 is fine-tuned on the entire dataset, while the few-shot models (CURIE, DAVINCI, CODEX) use 15 examples in the prompt.Table 1: Semantic and structural metrics for the script generation task on PROSCRIPT. T5 is fine-tuned on the entire dataset, while the few-shot models (CURIE, DAVINCI, CODEX) use 15 examples in the prompt.

Table 2: Precision, recall, and F1 for PROSCRIPT edgeprediction task. COCOGEN with 15 samples outperforms strong few-shot models, and T5 trained on 100 samples.Table 2: Precision, recall, and F1 for PROSCRIPT edgeprediction task. COCOGEN with 15 samples outperforms strong few-shot models, and T5 trained on 100 samples.

completes the class by generating the code below the comment # edges.

Results Table 1 shows the results for script generation. The main results are that COCOGEN (based on CODEX), with just 15 prompt examples, outperforms the fine-tuned model T5 which has been finetuned on all 3500 samples. Further, COCOGEN outperforms the few-shot NL-LM CURIE across all semantic metrics and structural metrics. COCOGEN outperforms DAVINCI across all semantic metrics, while DAVINCI performs slightly better in two structural metrics.

Table 2 shows the results for edge prediction: COCOGEN significantly outperforms the NLLLMs CURIE and DAVINCI. When comparing with T5, which was fine-tuned, COCOGEN with only 15 examples outperforms the fine-tuned T5 which was fine-tuned on 100 examples. The impressive performance in the edge-generation task allows us to highlight the better ability of COCOGEN in capturing structure, while factoring out all models’ ability to generate the NL content.


[1] As of June 2022

[2] https://beta.openai.com/docs/models/gpt-3


Authors:

(1) Aman Madaan, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(2) Shuyan Zhou, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(3) Uri Alon, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(4) Yiming Yang, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(5) Graham Neubig, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]).

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Picture This Deal: 29% Off a Frameo Digital Frame For Your Favorite Photos
Next Article StrictlyVC London agenda for May 13 | News
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Heartstopping moment self-driving car smashes into van parked in driveway
News
Six iPhone Safety Tools You Should Know About
News
Today's NYT Connections: Sports Edition Hints, Answers for May 10 #229
News
The 5 Best Mother’s Day Weekend Deals on Headphones, Robovacs, TVs, and More
News

You Might also Like

Computing

GNOME Showtime Accepted As Video Player App For GNOME 49

0 Min Read
Computing

The HackerNoon Newsletter: If Youre an Amazon Ring Owner, You May Be an Accidental Spy (5/9/2025) | HackerNoon

2 Min Read

New Purpose-Built Blockchain T-Rex Raises $17 Million to Transform Attention Layer In Web3 | HackerNoon

8 Min Read
Computing

Ninja Deep Research: The AI Agent Everyone Can Actually Start Using Now | HackerNoon

10 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?