Yellowbrick, a SQL data platform provider, has significantly reduced costs by moving workloads from the public cloud to its own private Kubernetes-based infrastructure. It has reported an annual saving of $3.9 million by moving its development and testing environments away from AWS, Azure, and Google Cloud Platform.
According to Neil Carson of Yellowbrick, the company had been spending about $6 million per year across the three major cloud providers when the repatriation project began in 2022. Yellowbrick built its new private cloud using hardware that had previously been used for another purpose within the company. The company traditionally sells appliances to run its database product on, and it realised it could reuse these appliances when its customers upgraded and returned the older servers.
We thought elasticity in the cloud had to be cheaper than building appliances, but we found out the hard way, it wasn’t cheaper. It was much more expensive
– Neil Carson, Yellowbrick
The private cloud solution, named EC3 (Emerald City), uses two types of racks: compute racks and object storage racks. The system employs MinIO for object storage and LINSTOR for persistent block storage, running on a complex networking setup that utilises InfiniBand networking.
The current EC3 deployment consists of over 200 servers returned by their customers, providing over 8,000 vCPUs and about 2 petabytes of object storage. The primary ongoing cost is $50,000 per month in colocation facility fees in Utah. In comparison, Carson estimates that equivalent capacity on AWS would cost them around $375,000 per month.
Image courtesy of Neil Carson, Yellowbrick
The transition wasn’t without its challenges. The initial implementation needed dedicated focus, with the equivalent of two full-time engineers working on it for six months. However, now the cloud is live, ongoing administration needs a couple of developers to spend a few hours weekly on maintenance. The company reports that regular hardware failures occur approximately every couple of weeks, leading to plans for implementing automated problem detection systems.
While Yellowbrick’s situation is unusual due to the availability of returned paid-off hardware at effectively zero capital cost, Carson suggests that companies starting from scratch could still see substantial savings. He estimates that the initial capital expenditure for new equipment would be approximately $1.65 million, including $1.3 million for compute, $80,000 for switching and cables, and $270,000 for SSD storage.
The success of EC3 has influenced Yellowbrick’s product development. Their third-generation appliances, codenamed Griffin, will be powered by RedHat OpenShift and incorporate lessons learned from building the private cloud infrastructure. The company’s experience suggests that Kubernetes has become a game-changer in the infrastructure space. As Carson explains, “All the elasticity, scale up/down, and flexibility that used to require the public cloud can now be done on equipment we own, too.”
Carson acknowledges that advocating for on-premises solutions remains controversial in the tech industry. “When I’ve heard others trying to share these viewpoints, they have been accused of lying, not calling out hidden costs, being backward, etc.,” he writes. However, he maintains that for compute-intensive workloads, the financial benefits of repatriating are clear.
The repatriation initiative came to the world’s attention with a well-publicised move of David Heinemeier Hansson’s HEY and Basecamp infrastructure out of the cloud in 2022, demonstrating that consistent and non-spiky CPU-intensive workloads can often be run at a significant discount on customer’s own equipment.
A recent blog post from Puppet details why some organisations choose to move some workloads away from the public cloud. Cost is still a primary driver, with organisations seeing increased costs across computing, storage, and data transfer. Security and compliance are also big factors, with some choosing to repatriate workloads to avoid having to answer complex questions about data storage locations, access controls, and regulatory compliance across multiple cloud environments.
Image courtesy of Puppet
Puppet also explains how performance limitations have prompted some companies to look away from public clouds, especially for workloads needing low latency. Also, some organisations are concerned about the risk of vendor lock-in and don’t want to depend on specific providers for their software, platform, or infrastructure needs.
The Puppet article also raises technical challenges unique to the public cloud – such as accidental misconfigurations potentially having drastic consequences (citing a 2022 incident where a single AWS S3 bucket misconfiguration exposed millions of sensitive files).
Despite the significant advantage of having a pool of returned servers to repurpose, the success of Yellowbrick’s private cloud implementation adds to a growing body of evidence that some companies, particularly those with predictable compute-intensive workloads, may benefit from evaluating alternatives to public cloud services.