By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries
News

Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries

News Room
Last updated: 2026/04/06 at 11:32 AM
News Room Published 6 April 2026
Share
Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries
SHARE

Pinterest Engineering has significantly improved the reliability of its Apache Spark workloads, cutting out-of-memory (OOM) failures by 96% through a combination of improved observability, configuration tuning, and automatic memory retries. This work addresses persistent job failures that disrupted pipelines, increased on-call load, and threatened timely analytics for memory-heavy workloads powering recommendation systems and large-scale data processing.

For years, OOM errors were a persistent headache. Jobs would fail late in execution, often after hours of computation, forcing engineers to manually tweak memory settings to keep pipelines running. These failures disrupted downstream processes, increased on-call load, and made it harder for teams to focus on delivering new features. Fixing the problem required both technical and workflow-level solutions to reduce failures while minimizing manual effort.

A critical first step was improving visibility into how jobs consumed memory. Engineers built detailed metrics for executor memory usage, shuffle operations, and task execution times. This data helped identify hotspots, skewed partitions, and stages that were unusually resource-hungry. As Pinterest engineers explained in their blog, understanding where memory is consumed within a job is critical to addressing failures effectively.By knowing exactly where problems arose, the team could make precise adjustments rather than simply adding memory across the board.

Visualizing executor-level memory usage and Auto Memory Retry in Spark workflows (Source: Pinterest Blog Post)

Configuration tuning complemented these insights. Spark settings for memory allocation, shuffle partitions, and broadcast joins were optimized for workload patterns. Adaptive query execution allowed the system to adjust partitioning dynamically, reducing memory pressure during heavy stages. Additional preprocessing helped smooth out data skew, and validation checks flagged unusually large or anomalous datasets before they could trigger failures. For high-risk jobs, human review remained part of the workflow, ensuring pipelines stayed stable and predictable.

Auto Memory Retries represented a major workflow shift. Jobs that previously failed due to memory exhaustion could now automatically restart with updated memory settings. This automation eliminated much of the manual tuning that had been consuming engineering time while letting pipelines finish without changing core job logic.

The rollout was staged carefully. Engineers started with ad hoc jobs, ramping from 0% to 100%, and then moved to scheduled jobs, beginning with lower-priority tiers and eventually applying the feature to critical workloads. A dashboard tracked key metrics such as recovered jobs, cost savings, MB, and vCore seconds saved, and post-retry failures. This staged approach allowed the team to catch issues early, ensure reliability, and fine-tune retries before full deployment.

Along the way, teams learned important operational lessons, including improving scheduler performance for large TaskSets, handling custom resource profiles for Apache Gluten compatibility, and adjusting host failure exclusions so OOM failures no longer blocked retries. Future work includes proactive memory increases, where tasks in high-risk stages receive extra memory before failing, further reducing retries and cluster overhead.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article NetBSD 11.0 Nears Release With RC3 Released For Testing NetBSD 11.0 Nears Release With RC3 Released For Testing
Next Article Multi-OS Cyberattacks: How SOCs Close a Critical Risk in 3 Steps Multi-OS Cyberattacks: How SOCs Close a Critical Risk in 3 Steps
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Instagram Notes: What they are and how to use them in 2026
Instagram Notes: What they are and how to use them in 2026
Computing
Regulator closes investigation into Tesla 'actually smart summon' feature
Regulator closes investigation into Tesla 'actually smart summon' feature
News
The 0B question: AI’s appetite for compute is rewriting the rules of tech –  News
The $100B question: AI’s appetite for compute is rewriting the rules of tech – News
News
How to Find Content for Your Social Strategy with
How to Find Content for Your Social Strategy with
Computing

You Might also Like

Regulator closes investigation into Tesla 'actually smart summon' feature
News

Regulator closes investigation into Tesla 'actually smart summon' feature

0 Min Read
The 0B question: AI’s appetite for compute is rewriting the rules of tech –  News
News

The $100B question: AI’s appetite for compute is rewriting the rules of tech – News

6 Min Read
The Galaxy S27 Pro will be the final nail in the S Pen’s coffin
News

The Galaxy S27 Pro will be the final nail in the S Pen’s coffin

9 Min Read
What You’re Not Being Told About the AI Economy
News

What You’re Not Being Told About the AI Economy

19 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?