By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries
News

Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries

News Room
Last updated: 2026/04/06 at 11:32 AM
News Room Published 6 April 2026
Share
Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries
SHARE

Pinterest Engineering has significantly improved the reliability of its Apache Spark workloads, cutting out-of-memory (OOM) failures by 96% through a combination of improved observability, configuration tuning, and automatic memory retries. This work addresses persistent job failures that disrupted pipelines, increased on-call load, and threatened timely analytics for memory-heavy workloads powering recommendation systems and large-scale data processing.

For years, OOM errors were a persistent headache. Jobs would fail late in execution, often after hours of computation, forcing engineers to manually tweak memory settings to keep pipelines running. These failures disrupted downstream processes, increased on-call load, and made it harder for teams to focus on delivering new features. Fixing the problem required both technical and workflow-level solutions to reduce failures while minimizing manual effort.

A critical first step was improving visibility into how jobs consumed memory. Engineers built detailed metrics for executor memory usage, shuffle operations, and task execution times. This data helped identify hotspots, skewed partitions, and stages that were unusually resource-hungry. As Pinterest engineers explained in their blog, understanding where memory is consumed within a job is critical to addressing failures effectively.By knowing exactly where problems arose, the team could make precise adjustments rather than simply adding memory across the board.

Visualizing executor-level memory usage and Auto Memory Retry in Spark workflows (Source: Pinterest Blog Post)

Configuration tuning complemented these insights. Spark settings for memory allocation, shuffle partitions, and broadcast joins were optimized for workload patterns. Adaptive query execution allowed the system to adjust partitioning dynamically, reducing memory pressure during heavy stages. Additional preprocessing helped smooth out data skew, and validation checks flagged unusually large or anomalous datasets before they could trigger failures. For high-risk jobs, human review remained part of the workflow, ensuring pipelines stayed stable and predictable.

Auto Memory Retries represented a major workflow shift. Jobs that previously failed due to memory exhaustion could now automatically restart with updated memory settings. This automation eliminated much of the manual tuning that had been consuming engineering time while letting pipelines finish without changing core job logic.

The rollout was staged carefully. Engineers started with ad hoc jobs, ramping from 0% to 100%, and then moved to scheduled jobs, beginning with lower-priority tiers and eventually applying the feature to critical workloads. A dashboard tracked key metrics such as recovered jobs, cost savings, MB, and vCore seconds saved, and post-retry failures. This staged approach allowed the team to catch issues early, ensure reliability, and fine-tune retries before full deployment.

Along the way, teams learned important operational lessons, including improving scheduler performance for large TaskSets, handling custom resource profiles for Apache Gluten compatibility, and adjusting host failure exclusions so OOM failures no longer blocked retries. Future work includes proactive memory increases, where tasks in high-risk stages receive extra memory before failing, further reducing retries and cluster overhead.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article NetBSD 11.0 Nears Release With RC3 Released For Testing NetBSD 11.0 Nears Release With RC3 Released For Testing
Next Article Multi-OS Cyberattacks: How SOCs Close a Critical Risk in 3 Steps Multi-OS Cyberattacks: How SOCs Close a Critical Risk in 3 Steps
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Tech Moves: Microsoft names corporate VP; Amazon exec departs for Google; Zoom names CPO
Tech Moves: Microsoft names corporate VP; Amazon exec departs for Google; Zoom names CPO
Computing
iOS 26.5 Public Beta: Is End-to-End Encrypted RCS Messaging Finally Coming to iPhone?
iOS 26.5 Public Beta: Is End-to-End Encrypted RCS Messaging Finally Coming to iPhone?
News
FreeBSD Aims To Better Track Laptop Hardware That Works Or Doesn’t For Their OS
FreeBSD Aims To Better Track Laptop Hardware That Works Or Doesn’t For Their OS
Computing
Your Smart Home Is a Target for Hackers. Lock It Down With These Quick Tips
Your Smart Home Is a Target for Hackers. Lock It Down With These Quick Tips
News

You Might also Like

iOS 26.5 Public Beta: Is End-to-End Encrypted RCS Messaging Finally Coming to iPhone?
News

iOS 26.5 Public Beta: Is End-to-End Encrypted RCS Messaging Finally Coming to iPhone?

4 Min Read
Your Smart Home Is a Target for Hackers. Lock It Down With These Quick Tips
News

Your Smart Home Is a Target for Hackers. Lock It Down With These Quick Tips

13 Min Read
Apple may have scraped YouTube videos without permission for AI training
News

Apple may have scraped YouTube videos without permission for AI training

1 Min Read
Microsoft: Copilot AI is for ‘entertainment purposes only,’ not ‘important advice’
News

Microsoft: Copilot AI is for ‘entertainment purposes only,’ not ‘important advice’

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?