By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Think-and-Execute Improves Algorithmic Reasoning: Here’s How | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Think-and-Execute Improves Algorithmic Reasoning: Here’s How | HackerNoon
Computing

Think-and-Execute Improves Algorithmic Reasoning: Here’s How | HackerNoon

News Room
Last updated: 2025/03/20 at 12:28 PM
News Room Published 20 March 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

2 Think-and-Execute

3 Experimental Setup

4 Results

5 Analysis

6 Related Work

7 Limitations and Discussion

8 Conclusion and References

A Experimental Details

B Details of Think-and-Execute

C Prompts Used in Our Experiments

D Human-written Pseudocode Prompts

E Generated Analyses

F Generated Pseudocode Prompts

G Qualitative Analysis

4 Results

4.1 THINK-AND-EXECUTE Improves Algorithmic Reasoning

We start by comparing our framework with direct prompting and zero-shot CoT Kojima et al. (2022) in Table 1. We find that zero-shot CoT performs better than direct prompting with average improvements of 11.1% with GPT-3.5-Turbo, respectively, suggesting zero-shot CoT to be a strong baseline. Our THINK-AND-EXECUTE, however, further outperforms both of them significantly regardless of model sizes, which indicates that explicitly generating a plan is an effective way to improve the LLM’s reasoning capabilities than simply encouraging LLMs to generate their intermediate reasoning steps.

4.2 Task-level Pseudocode Prompts Benefits a Wider Range of Algorithmic Reasoning Tasks than Instance-specific Python Code

In Table 1, PoT shows performance gains in some tasks over direct prompting (e.g., Navigate; Tracking Shuffled Objects) with Python code generated specifically for each instance and the corresponding interpreter output as the answer. However, such improvement is difficult to generalize to all tasks, e.g., 0.4% accuracy in both Dyck Language and Temporal Sequences, with GPT-3.5-Turbo. By contrast, THINK-AND-EXECUTE outperforms PoT and direct prompting in all tasks with GPT-3.5-Turbo. This suggests that making the task-level

Figure 3: Ablation study of the components of pseudocode prompt using GPT-3.5-Turbo.Figure 3: Ablation study of the components of pseudocode prompt using GPT-3.5-Turbo.

Table 2: Ablation on Step2 of THINK phase.Table 2: Ablation on Step2 of THINK phase.

strategy with pseudocode and applying it to each instance can benefit LLM’s reasoning in a wider range of algorithmic reasoning tasks than generating instance-specific Python codes.

4.3 The Logic Discovered by an LLM can be Transferred to SLMs

We further explore if the pseudocode prompt written by an LLM (i.e., GPT-3.5-Turbo as the instructor) can be applied to smaller LMs: the CodeLlama family in Table 1. When applying the pseudocode prompts generated by GPT-3.5-Turbo, CodeLlama-7B and -13B significantly outperform direct prompting. Moreover, THINK-AND-EXECUTE with CodeLlama-13B shows comparable performance with GPT-3.5-Turbo with PoT and direct prompting.

4.4 Pseudocode Better Describes the Logic for Solving a Task than Natural Language

We also compare our approach with NL planning, a variant of ours that utilizes natural language to write the task-level instruction, instead of pseudocode. In practice, we provide human-written NL plans that contain a similar amount of information to P in the meta prompt and use it to generate the task-level NL plan for the given task. Surprisingly, although the LMs are fine-tuned to follow natural language instructions, we find that task-level pseudocode prompts can boost their performance more than NL plans (Table 1).

4.5 Ablation Studies

Components of the pseudocode prompt. We conduct an ablation study on each component of the pseudocode prompt. For that, we prepare four types of pseudocode prompts: (1) Human-written pseudocode; (2) Human-written prompt w/o comments and semantics by removing the comments that explain the code and replacing variable names with meaningless alphabets, such as X, Y, and Z; (3) Human-written prompt w/ for loop and (4) w/ intermediate print() statements. The results are in Figure 3. Model performance decreases significantly when applying prompts w/o comments and semantics, especially in Temporal Sequences. This implies that semantics play an important role in guiding the LLMs to apply the discovered logic and reasoning with it accordingly. Also, we find that printing out the intermediate execution steps with print() is crucial in reasoning, which is consistent with the finding from Wei et al. (2022).

Generating the analysis before the pseudocode prompt. Table 2 shows a notable decrease in model performance when generating pseudocode prompts without conducting the

Table 3: Left: Comparison of THINK-AND-EXECUTE, Chain-of-Code (Li et al., 2023), and Plan-and-Solve (Wang et al., 2023) using GPT-3.5-Turbo. Right: Comparison of THINK-ANDEXECUTE and Self-Discover (Zhou et al., 2024) using GPT-4. The results of Self-Discover are obtained from the original paper, as the code and prompts are not provided.Table 3: Left: Comparison of THINK-AND-EXECUTE, Chain-of-Code (Li et al., 2023), and Plan-and-Solve (Wang et al., 2023) using GPT-3.5-Turbo. Right: Comparison of THINK-ANDEXECUTE and Self-Discover (Zhou et al., 2024) using GPT-4. The results of Self-Discover are obtained from the original paper, as the code and prompts are not provided.

4.6 Comparison with other Baselines

We further compare THINK-AND-EXECUTE with another three baselines: (1) Plan-andSolve (Wang et al., 2023), where an LLM sequentially generates a natural language plan for solving the given instance, step-by-step reasoning according to the plan, and the final answer; (2) Chain-of-Code (Li et al., 2023), where Python code is generated as a part of intermediate reasoning steps specifically for a given instance; (3) Self-Discover (Zhou et al., 2024), a concurrent work that devises a task-level reasoning structure in a JSON format before inferencing the instance. First, as presented in Table 3 (Left), we find THINK-AND-EXECUTE largely outperforms Plan-and-Solve and Chain-of-Code by 10.9 and 32.3 percentage points in terms of accuracy, respectively. Second, while Self-Discover also incorporate task-level instruction, in Table 3 (Right), our THINK-AND-EXECUTE with pseudocode prompts shows better performance when using GPT-4 (Achiam et al., 2023).[3] These findings indicate that generating (1) task-level instruction with (2) pseudocode can better represent the necessary logic for solving a task and benefit LLM’s algorithmic ability


[3] We use gpt-4-0613 for GPT-4.

Authors:

(1) Hyungjoo Chae, Yonsei University;

(2) Yeonghyeon Kim, Yonsei University;

(3) Seungone Kim, KAIST AI;

(4) Kai Tzu-iunn Ong, Yonsei University;

(5) Beong-woo Kwak, Yonsei University;

(6) Moohyeon Kim, Yonsei University;

(7) Seonghwan Kim, Yonsei University;

(8) Taeyoon Kwon, Yonsei University;

(9) Jiwan Chung, Yonsei University;

(10) Youngjae Yu, Yonsei University;

(11) Jinyoung Yeo, Yonsei University.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Nearly All Cybertrucks Have Been Recalled Because Tesla Used the Wrong Glue
Next Article The Harsh Reality of Building a Real-time ML Feature Platform
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

iPhone 17 Pro Rumors and Leaks: Here's What We've Learned So Far
News
Forget squats — these 4 glute bridge variations will build your backend
News
I Just Finished Playing Doom: The Dark Ages, and These Are My Top 5 Tips for Beating the Game
News
Sony’s new flagship XM6 headphones are here — but there’s a catch
News

You Might also Like

Computing

Generative AI in E-commerce: Use Cases & Examples |

25 Min Read
Computing

The HackerNoon Newsletter: Is AI Making People Delusional? (5/17/2025) | HackerNoon

2 Min Read
Computing

Jotform Pricing: Plans and Features to Choose the Best One

23 Min Read
Computing

Ring Cameras Can Perpetuate Bias to Police: Here’s How | HackerNoon

6 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?