In a recent blog post, Pinterest Engineering detailed its approach to addressing network throttling challenges encountered while operating on Amazon EC2 instances. As a platform serving over 550 million monthly active users, ensuring consistent performance is paramount, especially for critical services like their machine learning feature store, KVStore.
Pinterest observed increased latency and occasional service disruptions in KVStore, particularly during periods of high traffic. These issues often led to application timeouts and cascading failures, adversely affecting user engagement on features like the Homefeed. The root cause was traced to network performance limitations inherent in certain EC2 instance types, which offer “up to” a specified bandwidth. For example, an instance labeled with “up to 12.5 Gbps” might have a baseline bandwidth significantly lower, relying on burst capabilities that are not guaranteed. When network usage exceeded these baselines, packet delays and losses ensued, impacting application performance.
In 2024, Pinterest initiated a migration to AWS’s Nitro-based instance families, such as transitioning from i3 to i4i instances, aiming for improved performance. However, this shift introduced new challenges. During bulk data uploads from Amazon S3 to their wide-column databases, they observed significant performance degradation, particularly in read latencies, resulting in application timeouts. These findings prompted a temporary halt to the migration of over 20,000 instances.
With improved visibility into their network performance, Pinterest implemented several key strategies to mitigate EC2 network throttling. One of the primary approaches was selecting EC2 instances with higher baseline network bandwidth to better support their workloads, moving away from instances that only promised burstable performance. They also introduced traffic shaping techniques to regulate data flow and ensure network usage stayed within optimal thresholds.
In addition, Pinterest distributed workloads more evenly across multiple instances, reducing the risk of overloading any single resource. These combined efforts significantly enhanced the reliability and stability of their systems, effectively minimizing latency spikes and preventing the kind of service disruptions that had previously impacted user experience.
Pinterest’s experience underscores the importance of understanding the nuances of cloud infrastructure, particularly the implications of network bandwidth limitations on EC2 instances. By proactively monitoring and adjusting their infrastructure, they successfully navigated the challenges of network throttling, ensuring a smoother experience for their vast user base.