By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Why LLMs Struggle with Arithmetic Puzzles | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Why LLMs Struggle with Arithmetic Puzzles | HackerNoon
Computing

Why LLMs Struggle with Arithmetic Puzzles | HackerNoon

News Room
Last updated: 2025/08/23 at 7:19 PM
News Room Published 23 August 2025
Share
SHARE

:::info
Authors:

(1) Haolong Li, Tongji Universiy and work done during internship at ByteDance ([email protected]);

(2) Yu Ma, Seed Foundation, ByteDance ([email protected]);

(3) Yinqi Zhang, East China Normal University and work done during internship at ByteDance ([email protected]);

(4) Chen Ye (Corresponding Author), ESSC Lab, Tongji Universiy ([email protected]);

(5) Jie Chen, Seed Foundation, ByteDance and a Project Leader ([email protected]).

:::

Table of Links

Abstract and 1 Introduction

2 Problem Definition

2.1 Arithmetical Puzzle Problem

2.2 Data Synthesizing

2.3 Dataset

3 Model

4 Experiments

4.1 Evaluation

4.2 Results

4.3 Case Studies

5 Conclusion and Acknowledgements

6 Limitations

7 Ethics Statement and References

A Appendix

A.1 Hyperparameter Settings

A.2 Evaluation of the Base Model

A.3 Case Study

A.4 Visualization of the Proposed Puzzle

A.1 Hyperparameter Settings

In the SFT stage, we follow common fine-tuning hyperparameter settings for our model. We set learning rate to 1e−4 and adopt the cosine learning rate scheduler. We use low-rank adaptation (LoRA) tuning with a rank of 5, α of 32, and dropout of 0.05. And we employ Adamw optimizer with β1 = 0.9, β2 = 0.95 and ϵ = 1e − 9. Eight NVIDIA A100-SXM4-80GB GPUs are used to train the model with a batch size of 50 and the maximum epoch set to 5. Detailed settings are listed in Table 3.

A.2 Evaluation of the Base Model

We evaluate the base model (open-llama-3B) on the proposed arithmetical puzzle problem. As shown in Table 4 and Table 5, with either the few-shot prompting (2-Shot, 8-Shot) or Chain-of-Thought (CoT), the base model performs poorly on the puzzle. We propose this is due to the symbolic form of our prompt, the model needs to understand the underlying pattern in order to solve the arithmetical puzzle. Without fine-tuning on the synthetic data, the model may struggle to comprehend such type of prompt.

Table 4: Evaluation of the base model with few-shot and Chain-of-Thought prompting. As expected, the base model performs poorly across all the prompting techniques.

Table 5: An example of Chain-of-Thought prompting and the generated response of the base model.

We further test several open-source (Llama-2-7B (Touvron et al., 2023a), Deepseek-Coder-33B (Guo et al., 2024)) and closed-source models (GPT4 (Achiam et al., 2023)) with few-shot prompting. As shown in Table 6, these models also perform poorly on our benchmarks. In Table 7, we provide an example of the CoT prompting and the generated responses from these models.

Table 6: Evaluation results of Llama-2-7B, Deepseek-Coder-33B, and GPT4 on our proposed benchmarks.

Table 7: An example of few-shot prompting and the generated responses of GPT4, Llama-2-7B, and DeepseekCoder-33B. We provide the models with two examples before the puzzle. As shown, all of the models fail to solve the given problem. GPT4 seems to understand the requirement of the puzzle, while the other two fail.

As shown in Table 7, Llama-2-7B fails to understand the requirement of the puzzle and just outputs two meaningless equations. Deepseek-Coder-33B treats the second example in few-shot prompting as the puzzle, and repeats the same calculations three times. It seems that GPT4 has well understood the prompt and used all the candidate integers only once, the calculations within the generated response are all right, while the solution is wrong. Actually, such kind of problem is very challenging, as the model needs to infer the requirement of the puzzle from the provided examples and then figure out the correct solution.

A.3 Case Study

Figure 4: Cases from the form OOD test dataset. The correct steps are highlighted in green, while the incorrect steps in red. Generally speaking, performance of model fine-tuned with 1M training data is the worst.

A.4 Visualization of the Proposed Puzzle

Figure 5: Visualization of the proposed arithmetical puzzle. Given the candidate integers 3, 6, 7, 51, 58 and the target integer 4, the answer is 58 − 51 = 7, 6 − 7 = −1, 3 × (−1) = −3, −3 + 7 = 4.

:::info
This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Scientists Have Identified the Origin of an Extraordinarily Powerful Outer Space Radio Wave
Next Article Save Over $300 on Access to 1TB of Secure Cloud Storage
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

EXT4 Patches Enable Block Size Greater Than Page Size Support
Computing
YouTube is using AI to fight AI deepfakes
News
China’s vehicle trade-in subsidy applications exceed ten million in 2025 · TechNode
Computing
4 Cheaper Android Tablet Alternatives To The iPad In 2025 – BGR
News

You Might also Like

Computing

EXT4 Patches Enable Block Size Greater Than Page Size Support

1 Min Read
Computing

China’s vehicle trade-in subsidy applications exceed ten million in 2025 · TechNode

1 Min Read
Computing

FFmpeg Introduces Vulkan Acceleration For Apple ProRes Video Decoding

2 Min Read
Computing

Oppo responds to green line screen issue, offers free screen replacement for affected devices · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?