By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Kubernetes Adds Predictable Pod Replacement for Jobs in v1.34 Release | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Kubernetes Adds Predictable Pod Replacement for Jobs in v1.34 Release | HackerNoon
Computing

Kubernetes Adds Predictable Pod Replacement for Jobs in v1.34 Release | HackerNoon

News Room
Last updated: 2025/12/07 at 6:47 PM
News Room Published 7 December 2025
Share
Kubernetes Adds Predictable Pod Replacement for Jobs in v1.34 Release | HackerNoon
SHARE

Kubernetes has become the go-to platform for running not just long-lived services, but also batch workloads like data processing, ETL pipelines, machine learning training, CI/CD pipelines, and scientific simulations. These workloads typically rely on the Job API, which ensures that a specified number of Pods run to completion.

Until now, Kubernetes has had limited flexibility when a Job’s Pod failed or was evicted. Pod replacement behavior was often unpredictable: would the replacement Pod get scheduled on the same node? a nearby node? or anywhere in the cluster?

With Kubernetes v1.34, a new feature lands: Pod Replacement Policy for Jobs, driven by KEP-3015. This allows users to explicitly control how replacement Pods are scheduled, improving reliability, performance, and efficiency of batch workloads.

Why Pod Replacement Matters

When a Pod belonging to a Job fails (e.g., due to node drain, eviction, OOM, or hardware issue), Kubernetes creates a replacement Pod. However:

  • The replacement may land anywhere in the cluster.
  • If the Pod had local data (e.g., cached dataset, scratch disk, node-local SSD), the replacement Pod may not find it.
  • If the Pod had NUMA or GPU locality, the replacement might end up with suboptimal hardware.
  • In multi-zone clusters, scheduling a replacement Pod across zones could increase latency and cross-zone costs.

For workloads that depend on node affinity or cached state, this can be a real problem.

Current behavior:

By default, Kubernetes’ controller replaces pods as soon as they start terminating, which can lead to multiple pods running for the same task at the same time, especially in indexed Jobs. This can result in issues with workloads that require exactly one Pod per task, such as certain machine learning frameworks.

Starting replacement pods before old pods are terminated fully can cause other problems like extra cluster resources being used for running replacement pods.

Feature: Pod Replacement Policy feature

This feature, Kubernetes jobs will have two pod replacement policies to choose from:

  • TerminatingOrFailed (default): will create a replacement Pod as soon as the old one starts terminating.

  • Failed: waits until the old Pod is fully terminated and reaches the Failed state before creating a new one pod

    Using policy: Failed ensures that only one Pod runs for a task at a time

:::info
Quick Demo: We will try to demo Pod Replacement Policy for Jobs feature for both scenarios

:::

SCENARIO 1: default behavior TerminatingOrFailed: demo steps.

  1. setup local kubernetes cluster (with minkube)
   brew install minikube
   # start local cluster
   minikube start --kubernetes-version=v1.34.0
![fig: start kubernetes minikube server](https://miro.medium.com/v2/resize:fit:1400/1*NnzqlUegwVbO8gWy8eZBNA.png)
   # verify cluster is running
   kubectl get nodes
   # verify kubernetes version: v1.34.0
![fig: check kubernetes nodes & version](https://miro.medium.com/v2/resize:fit:770/1*HaNAGHJ1gYfez8SO8mJk3w.png)

  1. define k8s job config with podReplacementPolicy: TerminatingOrFailed, apply job & monitor pods
   # worker-job.yaml
   apiVersion: batch/v1
   kind: Job
   metadata:
     name: worker-job
   spec:
     completions: 2
     parallelism: 1
     podReplacementPolicy: TerminatingOrFailed
     template:
       spec:
         restartPolicy: Never
         containers:
         - name: worker
           image: busybox
           command: ["sh", "-c", "echo Running; sleep 30"]

kubectl apply -f worker-job.yaml

# monitor pods are running
kubectl get pods -l job-name=worker-job

fig: monitor pods are running

  1. delete job pod manually and observe behavior
# delete pods associated with job:worker-job
kubectl delete pod -l job-name=worker-job

replacement pod:worker-job-qsnmp is created as soon as the old pod:worker-job-4vmtz starts terminating

scenario 2 : Delayed Replacement with Failed Policy: demo steps

  1. define k8s job config with podReplacementPolicy: Failed, apply job & monitor pods
# worker-job-failed.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: worker-job-failed
spec:
  completions: 2
  parallelism: 1
  podReplacementPolicy: Failed
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: worker
        image: busybox
        command: ["sh", "-c", "echo Running; sleep 1000"]

# monitor pods are running
kubectl get pods -l job-name=worker-job-failed

fig: monitor pods are running

  1. delete job pod manually and observe behavior
# delete pods associated with job:worker-job-failed
kubectl delete pod -l job-name=worker-job-failed

fig: pod: worker-job-failed-sg42q associated with job:worker-job-failed deleted

behavior: replacement pod:worker-job-failed-q98qx is created only after the old pod:worker-job-failed-sg42q fully terminates, there is no overlap between old and new pod.

Benefits

  1. Improved Reliability: Jobs are now self-healing. A single pod failure no longer risks halting an entire workload. This makes Kubernetes jobs more trustworthy for critical processes.
  2. Reduced Operational Burden: Previously, operators often had to monitor jobs manually or write custom controllers/scripts to handle pod replacement. With this built-in capability, operational overhead is significantly reduced.
  3. Efficient Resource Utilization: Failed pods that linger without progress waste CPU and memory. Automatic replacement ensures resources are recycled effectively.

Better User Experience: For developers, running jobs becomes less error-prone. Teams can focus on business logic instead of constantly monitoring for pod failures.

Best Practices

  1. Tune restart policies: Use Never or OnFailure appropriately depending on workload characteristics.
  2. Monitor metrics: Use Prometheus/Grafana to track pod replacement events.
  3. Set resource requests/limits: Prevent unnecessary failures by properly sizing pods.
  4. Validate thresholds: Ensure replacement policies are configured to avoid endless restart loops.
  5. Test in staging: Before deploying to production, simulate pod failures in a staging cluster to observe replacement behavior.

Use Cases

  1. Machine Learning Workloads: Training models can take hours or days, and pod failures are inevitable. Automatic replacement ensures training jobs continue without manual restarts, making ML pipelines more resilient.
  2. Data Pipelines: ETL jobs or distributed data processing tasks often involve multiple pods running in parallel. Replacing failed pods ensures the pipeline completes successfully without operator intervention.

Takeaways

Pod replacement policy gives control over Pod creation timing to avoid overlaps, optimizes cluster resources by preventing temporary extra pods,and offers flexibility to choose the right policy for your job workloads based on your requirements and resource constraints

Reference(s)

  • https://kubernetes.io/blog/2025/08/27/kubernetes-v1-34-release/ n

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Today's NYT Strands Hints, Answer and Help for Dec. 8 #645 – CNET Today's NYT Strands Hints, Answer and Help for Dec. 8 #645 – CNET
Next Article “New GMOs”: what are NGTs, and why is it debated? “New GMOs”: what are NGTs, and why is it debated?
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Sideloading apps on Android 16 QPR2 has a much nicer-looking UI
Sideloading apps on Android 16 QPR2 has a much nicer-looking UI
News
Automated Content Moderation: How Does It Work? | HackerNoon
Automated Content Moderation: How Does It Work? | HackerNoon
Computing
Every Bob's Burgers Christmas Episode, Ranked From Pretty Great to Perfect
Every Bob's Burgers Christmas Episode, Ranked From Pretty Great to Perfect
News
Godot 4.4 Dev 3: Vertex Shading, 2D Batching, and More | HackerNoon
Godot 4.4 Dev 3: Vertex Shading, 2D Batching, and More | HackerNoon
Computing

You Might also Like

Automated Content Moderation: How Does It Work? | HackerNoon
Computing

Automated Content Moderation: How Does It Work? | HackerNoon

13 Min Read
Godot 4.4 Dev 3: Vertex Shading, 2D Batching, and More | HackerNoon
Computing

Godot 4.4 Dev 3: Vertex Shading, 2D Batching, and More | HackerNoon

14 Min Read
Rust 1.78.0: What’s In It? | HackerNoon
Computing

Rust 1.78.0: What’s In It? | HackerNoon

8 Min Read
How We Migrated a Billion-Record Database With Zero Downtime | HackerNoon
Computing

How We Migrated a Billion-Record Database With Zero Downtime | HackerNoon

8 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?