The software industry largely relies on integration tests and traditional alerting for issue prevention and detection. This lags behind the internal standard set by top tech companies, which use sophisticated techniques to detect issues long before they can impact a large set of customers.
Detecting an issue earlier means you minimize or even eliminate customer impact. In this article, you will learn about some of the most effective issue-detection techniques and when seasoned tech companies deploy them.
Synthetic Monitoring
If you only have a few customers, it’s hard to measure whether the product is working correctly. Consider that you receive only a few transactions per hour on your website. If the transactions drop to zero, you don’t know whether the product is broken or whether customers just aren’t active at the time.
Synthetic monitoring is the solution to this problem. Synthetic monitoring is the usage of a simulated customer to probe a system and measure the results. A probe should execute all steps just like a real customer and should run at a fixed frequency. Your monitoring system should alert you if synthetic failures breach a certain threshold.
Synthetic monitoring is arguably the single most important issue detection and prevention mechanism you can have – more essential than even integration or end-to-end tests. This is because the best way to validate the end experience of real customers is by setting up synthetic customers.
Canary Testing
Canaries are more sensitive to carbon monoxide, so they were used in coal mines to identify whether the gas was present. If a canary became sick or died, the miners would evacuate the mine. For an internet service, this is paralleled by the practice of exposing a new feature or change to a small set of users (a ‘canary’), and monitoring whether those users are adversely affected by the change.
Canary testing requires you to measure the experience of the canary users for deviations in latency and availability. If the metrics degrade for the canary, you should automatically abort and roll back the change. If no degradation is observed, the change should automatically be rolled out to a progressively larger set of customers.
Canary testing should be one of the pillars of your rollout process, but it may not be particularly helpful when your customer traffic is very low. That said, most modern rollout tools support canary testing, so it’s still a good idea to set it up early on.
Shadow Testing
Imagine you’re a food delivery app and are launching a new algorithm for driver selection that improves delivery times. This is a business-critical functionality, so you want to be very cautious.
In shadow testing, the new algorithm is run alongside the original one for a subset of traffic, but the original algorithm would continue to be used for delivery selection. You log the results of both algorithms and compare them. If the two algorithms disagree a majority of the time, you should probably investigate whether the new algorithm is selecting appropriate delivery drivers.
Shadow testing is a great tool to use when your product has so many usage permutations that validating them through conventional tests is impossible. In our delivery example, it is not possible to replicate all the nuances of the real world, and so shadow testing comes to the rescue.
Notice that shadow testing doesn’t prove that the new algorithm has better delivery times. Once you’ve validated that the new algorithm is giving ‘reasonable’ outputs, you should do an A/B test to confirm that it actually improves delivery times.
Automated Load Testing
The worst moment to find out that you can’t handle scale… is when you need to handle scale. A standard practice for high-traffic products is an automated load test that runs before any change goes to production.
This involves configuring a load test environment that can mimic production and generate synthetic traffic that pushes the system to its limits. You then need to define the success parameters for the load test. A principled way to do this is to baseline the resultant metrics (latency, availability) against those of an older, successful load test. This will also help you catch changes that cause measurable degradation in system performance.
Conclusion
The software industry has long relied upon the ‘test pyramid’, which has unit tests at the bottom and integration or end-to-end tests at the apex of the pyramid. The test pyramid is hopelessly outdated in a world where most services are built out of microservices, and most software is accessed through the internet. In the modern context, the practical utility of techniques such as synthetic monitoring or canary testing is actually far greater than that of integration tests.
In this article, we’ve explored some of the most effective techniques for catching issues early, though it’s certainly not an exhaustive list. Depending on the nature of your system, you may also wish to evaluate anomaly detection, failure injection, or spike testing. Most importantly, leverage every outage to investigate what detection mechanisms you’re lacking, and fix the gaps you find.