Recently, Netflix discussed how they utilize eBPF to accurately attribute flow IP addresses to their corresponding workload identities. After implementing this new attribution method, Netflix verified the flow logs of their cloud gateway – Zuul, and found no misattribution over a two-week window.
Cheng Xie, Bryan Shultz, and Christine Xu from the Netflix engineering team elaborated on how they eliminated the misattribution issue in a blog post. To set the context of the challenge, in cloud environments like Netflix’s, IP addresses are frequently reassigned as services start and shut down. Initially, Netflix used Sonar, which sent IP address change notifications to a backend service called FlowCollector.
However, due to the scale of distributed systems at Netflix, the notifications would get delayed or fail, leading to incorrect service identification.
FlowCollector, collects flow logs from FlowExporter – a sidecar that runs alongside all workloads in the AWS Cloud. It further attributes the IP addresses and sends these attributed IP address flows to Netflix’s Data Mesh for further stream and batch processing. Even after adding 15-minute hold to cater for delayed IP address change events before attribution, misattribution was causing an incorrect workload dependency.
Over the last year, Netflix has developed a new method for dealing with attribution issues related to local, remote, cross-regional, and non-workload IP addresses.
Regarding local IP address attribution, with EC2 instances, FlowExporter reads service identity information directly from local disks. However, when it comes to containerized applications running on Titus (Netflix’s container platform), Netflix’s engineering team utilized a tool called IPMan.
IPMan creates mappings between IP addresses and service IDs in a specific data structure that FlowExporter can access. For the translation issue between IP address versions (IPv6 and IPv4), the team modified their platform to track combinations of local addresses and ports to maintain correct service identification.
Source: How Netflix Accurately Attributes eBPF Flow Logs
FlowCollector uses connections with already identified local IPs for remote IP attribution to understand when each service owns specific IP addresses. Each FlowCollector node maintains a memory-based lookup table that maps IP addresses to time periods with associated service identities.
These time periods are shared between servers using Kafka. This approach requires regular confirmations of IP address ownership, making it resilient to temporary issues. Income flow data is briefly stored for one minute before remote IP identification so that the latest ownership information is available.
The tech community on Hacker News took notice of this approach. This thread had responses from HN users suggesting other relevant network monitoring and log management tools such as Kubenetmon, Coroot-node-agent, and Retina.
We also saw this interesting thread on the same HN post, where the classic trade-off between using managed services for convenience, such as AWS CloudWatch, and building custom solutions for cost optimization was debated.
In conclusion, the Netflix team mentioned that the previous approach had approximately 40% misattribution, which was eliminated after using the new approach. The effectiveness was confirmed by analysing the flow logs to Netflix’s cloud gateway, Zuul.