Table of Links
Abstract and 1 Introduction
2 COCOGEN: Representing Commonsense structures with code and 2.1 Converting (T,G) into Python code
2.2 Few-shot prompting for generating G
3 Evaluation and 3.1 Experimental setup
3.2 Script generation: PROSCRIPT
3.3 Entity state tracking: PROPARA
3.4 Argument graph generation: EXPLAGRAPHS
4 Analysis
5 Related work
6 Conclusion, Acknowledgments, Limitations, and References
A Few-shot models size estimates
B Dynamic prompt Creation
C Human Evaluation
D Dataset statistics
E Sample outputs
F Prompts
G Designing Python class for a structured task
H Impact of Model size
I Variation in prompts
A Few-shot models size estimates
As OpenAI has not released any details of the size of their few-shot models, we estimate the relative strengths and weaknesses on code and text generation by calculating the average loss per token. To calculate the avg. loss of each of these models on code, we use the implementation provided by Xu et al. (2022).[5] The perplexity on text corpus was evaluated on 30 random wikipedia pages from Wikiplots[6] following a similar procedure The structure and text generation capabilities of the models are apparent from the results in Table 7; DAVINCI outperforms CODEX on text generation but is worse on code-generation and vice-versa. CURIE underperforms both DAVINCI and CODEX significantly. Importantly, these results show that CODEX and DAVINCI are of comparable capacities, making their comparison fair.
B Dynamic prompt Creation
As an alternative to creating prompts, there is now a growing interest in customizing the in-context examples each example Ttest. Popular techniques typically train a retriever, which is used to fetch the examples in the training set that are closest to Ttest (Liu et al., 2021; Rubin et al., 2021; Poesia et al., 2021).
Specifically Poesia et al. (2021) train a retriever with a target-similarity tuning (TST) objective over a corpus of D of (x, y) examples. TST learns an embedding function f such that for a pair of examples (xi , yi) and (xj , yj), if yi ∼ yj ⟹ f(xi) ∼ f(xj). For a new x, f(x) is used to retrieve the closest examples from D.
We follow Poesia et al. (2021), and train a knowledge-similarity tuner (KST). We use mpnet5 https://github.com/VHellendoorn/ Code-LMs#evaluation 6 https://github.com/markriedl/ WikiPlots base[7] with SentenceTransformers (Reimers and Gurevych, 2019) to fine-tune a retrieval function f by minimizing the following loss:
where fθ is parameterized using a transformer.
Results on using KST with PROSCRIPT (Table 8) and EXPLAGRAPHS (Table 9). While KST is highly effective for edge-prediction 6, the results are mixed for EXPLAGRAPHS and PROSCRIPT. For PROSCRIPT, KST yields marginal gains. However, for EXPLAGRAPHS, a number of training examples have overlapping theme (Table 10), and thus creating a prompt dynamically reduces the effective information in the prompt.
[5] https://github.com/VHellendoorn/Code-LMs#evaluation
[6] https://github.com/markriedl/WikiPlots
Authors:
(1) Aman Madaan, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);
(2) Shuyan Zhou, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);
(3) Uri Alon, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);
(4) Yiming Yang, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);
(5) Graham Neubig, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]).