Key Takeaways
- Kubernetes Horizontal Pod Autoscaler (HPA)’s delayed reactions might impact edge performance, while creating a custom autoscaler could achieve more stable scale-up and scale-down behavior based on domain-specific metrics and multiple signal evaluations.
- Startup time of pods should be included in the autoscaling logic because reacting only when CPU spiking occurs delays the increase in scale and reduces performance.
- Safe scale-down policies and a cooldown window are necessary to prevent replica oscillations, especially when high-frequency metric signals are being used.
- Engineers should maintain CPU headroom when autoscaling edge workloads to absorb unpredictable bursts without latency impact.
- Latency SLOs (p95 or p99) are powerful early indicators of overload and should be incorporated into autoscaling decisions alongside CPU.
Over the past ten years, Kubernetes has evolved into one of the foundational platforms underlying today’s modern IT infrastructure. Kubernetes allows organizations to manage large-scale, highly distributed, container-based workloads through its ability to provide an extendable architecture, as well as to automate a variety of tasks by providing a declarative model for defining resources.
As such, it provides a very scalable method for managing distributed workloads. However, although many organizations run their Kubernetes deployments within cloud environments that offer unlimited processing power, a significant transition to the use of edge computing is creating new operational requirements for Kubernetes users.
Edge computing involves running applications on devices or servers located close to where data is generated, rather than in a centralized cloud. Applications running at the edge must meet extremely low-latency requirements, be highly elastic, and perform predictably when subjected to large and unpredictable spikes in workload volume.
Because edge applications have limited processing capacity, memory, and network bandwidth, it is critical to use these resources efficiently and to scale edge applications rapidly in order to maintain both the quality of experience for end users and the reliability of services.
Kubernetes includes the Horizontal Pod Autoscaler (HPA) capability to dynamically adjust the number of pods within a deployment based on the current level of usage, including CPU, memory, and custom-defined metrics.
HPA is effective at reacting to observed traffic patterns in cloud environments; however, it is significantly less effective in managing the dynamic, bursty nature of edge workloads, where pod-scaling alternatives such as KEDA or custom autoscalers may be more suitable.
Figure 1: Horizontal pod autoscaler working
HPA’s rigidness, its dependency upon lagging metrics, and its lack of contextual awareness commonly cause HPA’s pod counts to scale too much, too little, or oscillate repeatedly. These behaviors can be very expensive and even dangerous when operating in resource-constrained environments.
I built an autoscaler using Custom Pod Autoscaler (CPA) for edge computing to address some of the limitations of HPA. CPA provides engineers the ability to develop their own algorithms, use combinations of multiple metrics, react quickly to changes in system state, and adjust the way they scale depending on the characteristics of the applications running on their clusters.
The Limits of Kubernetes HPA in Edge Scenarios
The function of the HPA is to determine a desired number of replicas for an application by utilizing a simply formulated proportional method as follows:
desiredReplicas = currentReplicas * currentMetricValue / desiredMetricValue
This formula is hard-coded into the Kubernetes system. As such, engineers can neither override the formula nor increase/decrease scaling aggressiveness nor add any other domain-specific logic without creating an entirely new autoscaler for their environment.
While this is suitable for cloud-based applications, it could be problematic for edge applications where latency sensitivity and variable traffic require more adaptive behaviour.
Lack of Algorithm Flexibility
Internet of things (IoT) gateways and gaming edge servers do not typically have a workload that is proportionate to their resources. In other words, an IoT gateway may increase its use by ten times for a short period of time (e.g., when it receives a large number of sensor events), whereas a gaming edge server will require additional instances before users can join a game (not after CPU increases).
Time-aware and predictive logic cannot be encoded in HPAs. HPA does not allow you to perform a gradual scale-down, nor can it rate limit scale-up; nor can you make decisions based on multiple metrics without having to heavily modify your application.
The inefficient way HPA handles short-lived spikes at the edge may cause:
- Registration bursts from devices
- Floods of user connections
- Spikes in media transcoding
- API surges resulting from gateway churn
HPA treats all bursts as sustained load, causing the system to scale up much too quickly and beyond what would be needed to handle a short spike.
Figure 2: Short spike indicating a burst
This quick change leads to a waste of pods, which can consume much-needed CPU and memory on smaller nodes, and ultimately, to even more node stress, or even eviction storms.
Operational Overhead of Custom Metrics
Kubernetes has added support for custom metrics as part of autoscaling/v2, but requires:
- A metrics server
- Custom metric api
- Prometheus
- Exporters
- Adapters
While this architecture is well-established and widely used, it introduces additional resource overhead and operational complexity that could be challenging in some resource-constrained environments.
Figure 3: HPA scaling with custom metrics and metrics server
The figure depicts the autoscaling/v2 version of HPA with the support of external metrics. In the above example, we have configured the cluster by adding the following dependencies:
- Deploy the custom metrics adapter (Istio) to query the metrics.
- The generated adapter will sign itself as “custom.metrics.k8s.io”.
- The installation of the Prometheus community follows this into the cluster.
- The generated Prometheus pod will scrape the pod endpoints to find the metrics required for our execution.
- The Istio adapter deployed in the first step is configured to query the Prometheus instance running in the cluster.
- The logic required for the querying is written into the deployment of HPA. The snippet below illustrates the same code for querying the number of requests per pod.
By applying load, we can evaluate the number of requests scraped with the help of a Grafana monitor and notice the scaling method in action.
Edge Architecture Context
The general structure of an edge computing environment will typically be built around several autonomous edge worker nodes located physically near the end users, with a central controller in a remote cloud or data center.
Each edge node operates independently, handles a high volume of local traffic, and performs many functions that are required to manage the flow of this traffic, so it needs to have a scalable system of logic to scale quickly, efficiently, and based on the current conditions. Due to the fact that backhaul bandwidth from the edge to the cloud is typically limited, round-trip communication can become very costly.
Examples of applications that may be run on an edge computing platform include:
- Gaming engine platforms
- Live video processing
- Low-latency computing for AR/VR
- Aggregation of IoT gateway data
- Inference for machine learning
- Caching/proxy services for localized content
Each of these applications exhibits a different scalability pattern than the one optimized by HPA’s traditional CPU-based algorithm.
Designing a Custom Pod Autoscaler
In order to overcome HPA’s inflexibility of scaling based on a few rigid behaviors, CPA has been developed, addressing the following requirements:
- Allow for arbitrary metrics to be monitored by the system (e.g., CPU, latency, queue depth, custom KPIs).
- Decouple monitoring metrics from the scaling algorithm.
- Allow for proactive scaling utilizing predictions or compensations.
- Prevent thrashing by enforcing safe down-scaling policies.
- Keep enough CPU Headroom to handle an edge workload’s variability in bursts.
- Be able to respond quicker than HPA while maintaining stability.
Figure 4: CPA’s architecture
The development of the CPA has removed the constraints imposed by native autoscalers in Kubernetes to allow developers to define their own logic for custom scalability while at the same time utilizing a scalable, Kubernetes-native controller.
The CPA Evaluation Algorithm
The scaling evaluation will take the scaled metrics from the metrics collection process and use these to make a decision. An early prototype had the evaluation component scale replicas based on fixed CPU thresholds that were incremented by fixed amounts as each threshold was met. This approach, while easy to implement, does not represent how real-world autoscaling systems operate, nor does it meet the operational requirements of edge-based workloads.
In contrast, the latest implementation includes an autoscaling model based on best practices from cloud service providers, game networking backends, and SRE teams with experience running large-scale, latency-sensitive platforms. The new model has replaced the rigid numeric thresholds with three primary workload condition signals: CPU headroom, latency SLO awareness, and pod startup compensation.
CPU Headroom
The CPU needed to keep up with spikes in edge workloads can be absorbed when there is enough CPU headroom, so there are no queuing or latency penalties. The autoscaler has a target utilization safety zone, usually between seventy and eighty percent, to maintain that headroom buffer. The autoscaler will calculate how many more replicas are needed to add back that margin if the average CPU usage of all pods is consistently above the threshold for headroom.
Latency SLO Awareness
If the application is exporting latency data, such as the ninety-fifth percentile (p95) response time, the CPA will use that latency data to make scaling decisions. In cases where the p95 latency is approaching or exceeding the service-level objective (SLO) for latency (e.g., sixty milliseconds for an interactive edge workload), the autoscaler will increase the number of replicas proportionate to the degree of violation of the latency SLO. Thus, the autoscaler avoids using CPU as the sole metric for performance, which is especially critical in IO-intensive and mixed workloads.
Pod Startup Compensation
In contrast to the faster-than-average container startup times seen in central cloud-based environments, the edge node typically exhibits longer container startup times due to lower disk throughput or image coldness. To address these long startup times, the autoscaler takes into account the estimated pod startup time (which is derived from local observations) and scales in anticipation of impending load. If CPU use is increasing rapidly enough that demand for CPU will likely outstrip available capacity before pods have finished booting, the autoscaler will trigger proactive scaling.
Together, these three input signals provide a composite scaling recommendation allowing the CPA to behave more context-aware than the fixed HPA algorithm used by default in Kubernetes, if:
- All signals are healthy, then no scaling occurs.
- One signal exceeds the corresponding threshold, then the autoscaler performs a moderate scale-up.
- Two or more signals exceed their respective thresholds, then the CPA will perform a larger-scale-up operation that is proportional to the level of severity of the signal(s).
Scaling down is intended to occur slowly and requires a window of stability to avoid the typical oscillations that can result from HPA behavior.
Thus, the CPA is transformed from a simple “threshold monitoring” reactive autoscaler to a context-aware autoscaling engine that can effectively manage real-world edge workloads.
Implementation and Load Generation
The CPA utilizes an open-source Custom Pod Autoscaler Framework that is a Kubernetes native controller for developing your own custom logic to scale pods using Python.
The Custom Pod Autoscaler framework will handle communication with Kubernetes. Developers will provide two Python scripts:
metric.py– collects/obtains the metricsevaluate.py– calculates the desired number of replicas
At a user-defined interval (default: every fifteen seconds), Kubernetes invokes the CPA controller. The controller runs the metric script, pipes the JSON output from the metric script into the evaluation script, and then scales the pods based on the result of the evaluation script.
CPA Config File
Each custom autoscaler is configured with a config.yaml that defines the metric source, evaluation logic, target workload, scaling limit, and execution interval:
name: cpautilization
namespace: default
interval: 10000
metricSource:
type: python
pythonScript: /cpa/metric.py
evaluation:
type: python
pythonScript: /cpa/evaluate.py
target:
kind: Deployment
name: testcpa
limits:
minReplicas: 1
maxReplicas: 20
Metric Gathering ScriptThe metric script retrieves metrics such as CPU usage, latency, or custom signals. In our implementation, Prometheus collects CPU and p95 latency.
from cpa import metrics
import json
def main():
cpu = metrics.get_average_cpu_utilization("testcpa")
replicas = metrics.get_current_replicas("testcpa")
latency = metrics.get_custom_metric("service_latency_p95_ms")
output = {
"resource": "testcpa",
"runType": "api",
"metrics": [{
"resource": "testcpa",
"value": json.dumps({
"current_replicas": replicas,
"avgcpu_utilization": cpu,
"p95_latency_ms": latency
})
}]
}
print(json.dumps(output))
if __name__ == "__main__":
main()
Evaluation Script
The following evaluation algorithm implements a scaling algorithm to calculate the desired number of replicas based on CPU, latency SLOs, pod startup time, and safe scale-up and scale-down constraints.
# PARAMETERS
CPU_HEADROOM_TARGET = 0.75 # Keep CPU at ~75% average usage
LATENCY_SLO_MS = 60 # Example latency SLO for interactive workloads
SCALE_UP_FACTOR = 1.3 # Increase replicas by 30% when overloaded
MAX_SCALE_UP_STEP = 4 # Never add more than 4 pods at once
SCALE_DOWN_FACTOR = 0.8 # Reduce replicas slowly
MIN_STABLE_SECONDS = 30 # 30 seconds of stable metrics to scale down
POD_STARTUP_SECONDS = 10 # Expected cold-start time
last_scale_time = 0
SCALE_COOLDOWN = 15
def main():
spec = json.loads(sys.stdin.read())
evaluate(spec)
def evaluate(spec):
global last_scale_time
if len(spec["metrics"]) != 1:
sys.stderr.write("Expected 1 metric")
exit(1)
eval_metric = json.loads(spec["metrics"][0]["value"])
current_replicas = eval_metric.get("current_replicas", 1)
avg_cpu = eval_metric.get("avgcpu_utilization", 0)
p95_latency = eval_metric.get("p95_latency_ms", None) # Optional metric
now = time.time()
# Cooldown protection (avoid thrashing)
if now - last_scale_time < SCALE_COOLDOWN:
output(current_replicas)
return
# Always start with current replicas
target_replicas = current_replicas
# CPU HEADROOM LOGIC
# Convert utilization to ratio
cpu_ratio = avg_cpu / 100.0
if cpu_ratio > CPU_HEADROOM_TARGET:
# Example: if CPU = 120%, scale by 120/75 = 1.6 ⇒ +60% replicas
scale_multiplier = min(cpu_ratio / CPU_HEADROOM_TARGET, SCALE_UP_FACTOR)
proposed = math.ceil(current_replicas * scale_multiplier)
# Cap the step size
step = min(proposed - current_replicas, MAX_SCALE_UP_STEP)
target_replicas = current_replicas + max(step, 1)
# LATENCY SLO LOGIC
if p95_latency and p95_latency > LATENCY_SLO_MS:
# Scale proportionally to the SLO violation
violation_ratio = p95_latency / LATENCY_SLO_MS
proposed = math.ceil(current_replicas * violation_ratio)
step = min(proposed - current_replicas, MAX_SCALE_UP_STEP)
target_replicas = max(target_replicas, current_replicas + step)
# POD STARTUP COMPENSATION
# scale ahead of predicted load.
if avg_cpu > 90 and POD_STARTUP_SECONDS > 0:
predicted_load = current_replicas * (avg_cpu / 50)
predicted_replicas = math.ceil(predicted_load)
step = min(predicted_replicas - current_replicas, MAX_SCALE_UP_STEP)
target_replicas = max(target_replicas, current_replicas + max(step, 1))
# SCALE DOWN SAFELY
# scale down if metrics are below thresholds.
if cpu_ratio < 0.40 and (not p95_latency or p95_latency < LATENCY_SLO_MS * 0.7):
proposed = math.floor(current_replicas * SCALE_DOWN_FACTOR)
target_replicas = max(1, proposed)
if target_replicas != current_replicas:
last_scale_time = now
output(target_replicas)
Validation & Evaluation
Enhanced autoscale logic improves behavior across all test scenarios. Logic for scaling based on headroom between CPU thresholds was more sophisticated and stable than the earlier prototype’s CPU threshold-based logic.
Stress over time
Continuous load with CPU headroom logic enables the CPA to scale smoothly while maintaining predictable utilization, while avoiding unnecessary replication expansion.
Short spikes
Pod startup compensation and conservative scale-down rules in the CPA prevented typical HPA overreactions to brief performance increases.
Gradual load increase
Latency-aware scaling caused the autoscaler to detect performance degradations before hard CPU limits were reached, resulting in faster and more precise responses.
Random load patterns were generated by varying headroom levels, dynamically adjusting latency SLO thresholds, and applying cooldown intervals to simulate irregular, real-world traffic behavior.
Figure 5: Random load generated
Figure 6: Number of pods deployed
Compared to HPA, the CPA exhibited characteristics including:
- Lower amplitude of oscillation
- Fewer replicas launched
- Faster return to steady state
- More stable average latency
- Reduced CPU waste
Figure 7: HPA vs. CPA
Lessons Learned
The new auto-scale logic provides the following benefits that are particularly relevant to edge applications:
- One metric does not fit all. CPU is just one of many metrics that can be used to determine performance, latency, and how long it takes for pods to start up.
- Predictive scaling helps reduce unstable behavior. Using pod startup compensation to predict when a cluster may become saturated by future requests has helped reduce the number of sudden saturation events.
- This auto-scale logic allows delaying scale down. Sudden scale down can create oscillations in the system, which can lead to a poor user experience. A slow scale-down creates a smooth behavior to which users are accustomed.
- In edge environments, where compute and memory resources are limited, aggressive replica scaling by the Horizontal Pod Autoscaler (HPA) can have unintended side effects, such as memory pressure, pod evictions, and throttling.
- There is greater flexibility. The CPA’s architecture design separates the collection of metrics from the logic that uses those metrics. As applications continue to produce more advanced telemetry, this separation will allow the scaling logic to evolve independently of the metrics collected.
Conclusion
By replacing fixed threshold-based scaling with CPU headroom targets using latency-aware evaluations and compensating for pod startup times, Custom Pod Autoscaler (CPA) has become a flexible and scalable solution. CPA provides engineers with the flexibility to create an auto-scale strategy that meets their applications’ performance requirements while addressing the scalability limitations of Kubernetes HPA on edge computing. While using a CPA-based strategy offers many benefits when implemented properly, it also calls for a great deal of fine-tuning, an operating culture that is disciplined, and high-quality metrics.
Therefore, while CPA can be used alongside autoscaling as provided natively by Kubernetes, a CPA-based strategy is most appropriate for applications where scaling affects both performance and user experience.
Recent developments in Kubernetes Event-Driven Autoscaling (KEDA) paired with Horizontal Pod Autoscaling (HPA) have greatly increased the options to scale based on events and external signals for a wide variety of workloads.
CPA is designed to operate within use cases where startup delays, resource limits, and multiple performance metrics are all factors in scaling decisions, which makes CPA particularly well-suited for edge environments. With proper tuning and monitoring, CPA represents an operational, scalable means for implementing predictable, efficient, and performance-sensitive autoscaling at the edge.
