Using Code-LLMs For Structured Commonsense Reasoning

Table of Links

Abstract and 1 Introduction

2 COCOGEN: Representing Commonsense structures with code and 2.1 Converting (T,G) into Python code

2.2 Few-shot prompting for generating G

3 Evaluation and 3.1 Experimental setup

3.2 Script generation: PROSCRIPT

3.3 Entity state tracking: PROPARA

3.4 Argument graph generation: EXPLAGRAPHS

4 Analysis

5 Related work

6 Conclusion, Acknowledgments, Limitations, and References

A Few-shot models size estimates

B Dynamic prompt Creation

C Human Evaluation

D Dataset statistics

E Sample outputs

F Prompts

G Designing Python class for a structured task

H Impact of Model size

I Variation in prompts

A Few-shot models size estimates

As OpenAI has not released any details of the size of their few-shot models, we estimate the relative strengths and weaknesses on code and text generation by calculating the average loss per token. To calculate the avg. loss of each of these models on code, we use the implementation provided by Xu et al. (2022).[5] The perplexity on text corpus was evaluated on 30 random wikipedia pages from Wikiplots[6] following a similar procedure The structure and text generation capabilities of the models are apparent from the results in Table 7; DAVINCI outperforms CODEX on text generation but is worse on code-generation and vice-versa. CURIE underperforms both DAVINCI and CODEX significantly. Importantly, these results show that CODEX and DAVINCI are of comparable capacities, making their comparison fair.

Table 7: Average loss per token of the three few-shot models used in this work. TEXT refers to the average loss over 30 Wikipedia pages, and CODE is the loss over Python scripts in the evaluation split of Polycoder.

B Dynamic prompt Creation

As an alternative to creating prompts, there is now a growing interest in customizing the in-context examples each example Ttest. Popular techniques typically train a retriever, which is used to fetch the examples in the training set that are closest to Ttest (Liu et al., 2021; Rubin et al., 2021; Poesia et al., 2021).

Specifically Poesia et al. (2021) train a retriever with a target-similarity tuning (TST) objective over a corpus of D of (x, y) examples. TST learns an embedding function f such that for a pair of examples (xi , yi) and (xj , yj), if yi ∼ yj ⟹ f(xi) ∼ f(xj). For a new x, f(x) is used to retrieve the closest examples from D.

We follow Poesia et al. (2021), and train a knowledge-similarity tuner (KST). We use mpnet5 https://github.com/VHellendoorn/ Code-LMs#evaluation 6 https://github.com/markriedl/ WikiPlots base[7] with SentenceTransformers (Reimers and Gurevych, 2019) to fine-tune a retrieval function f by minimizing the following loss:

where fθ is parameterized using a transformer.

Results on using KST with PROSCRIPT (Table 8) and EXPLAGRAPHS (Table 9). While KST is highly effective for edge-prediction 6, the results are mixed for EXPLAGRAPHS and PROSCRIPT. For PROSCRIPT, KST yields marginal gains. However, for EXPLAGRAPHS, a number of training examples have overlapping theme (Table 10), and thus creating a prompt dynamically reduces the effective information in the prompt.

[5] https://github.com/VHellendoorn/Code-LMs#evaluation

[6] https://github.com/markriedl/WikiPlots

Authors:

(1) Aman Madaan, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(2) Shuyan Zhou, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(3) Uri Alon, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(4) Yiming Yang, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(5) Graham Neubig, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]).

Using Code-LLMs for Structured Commonsense Reasoning | HackerNoon

Table of Links

A Few-shot models size estimates

B Dynamic prompt Creation

Leave a Reply Cancel reply

Stay Connected

Latest News

AirPods Prime Day Deals Arrive With AirPods Pro 2 at $169 and AirPods 4 at $99.99

Which Crypto to Buy Today for Long-Term Gains? Bitcoin Solaris Presale Offers Bitcoin-Level Growth Potential

Draymond Green reveals how Shaq and Charles Barkley helped shape TV career

Government Denies Involvement, Cites Technical Glitch

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

A Few-shot models size estimates

B Dynamic prompt Creation

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News