The key for solving Rumsfeld problems in modern software
For a long time, ‘testing in production’ was a fancy way of saying that someone is irresponsible. It was a slur for organisations whose testing practices are so bad that critical bugs constantly bite actual users. As our industry is rapidly transforming into an API supermarket glued together by cloud platforms, testing in production is no longer derogatory, it’s becoming necessary. Teams that don’t have an effective way to test in production are now irresponsible. And a key piece to make it all work is a hugely under-utilised gold mine of information, that most people already have: production errors.
According to Forbes, more than 80% of enterprise workloads will run on some form of cloud by 2020. Of course, we can dispute the number, but the trend is indisputable. More and more of our work is running on someone else’s platform, integrating with someone else’s APIs. For software quality, this means that there is a much higher risk of someone else changing something completely outside of our control, even after our software goes live.
A few years ago, Netflix chaos-monkeys became famous for actively causing and exploring infrastructure risks in production environments. But the moving parts affect more than just infrastructure. The real fun starts when we consider impacts on key business workflows from components outside our control. For example, a recent Chrome release changed how the browser handled cookies from Google’s own authentication service for third parties, and blocked lots of people trying to use MindMup with Google Drive. We had no changes deployed to production for a few days, so this sudden surge of problems caught us a bit by surprise. Luckily, we had tool in place to log and investigate client-side production errors, which at least gave us an early warning that something horrible had happened outside our control, and gave us enough information to quickly provide a workaround.
To contain problems outside of our control, it is critical to spot them quickly. For a major outage, people will start screaming at you via social media and sending angry emails. But smaller-scale problems are even trickier. If a similar problem affects just a small percentage of users, it can fly under the radar for a long time, and lead to a lot of user frustration and lost customers. The smaller the edge case, the more difficult it may be to reproduce. Because such problems happen in components outside of our zone of control, without any influence from us, they fall into the tricky category that Donald Rumsfeld famously called unknown-unknowns. Those are typically problems that we don’t know about, and we’re not even aware of that knowledge gap. The key to deal with Rumsfeld problems is observability, so we can at least eliminate one category of not knowing. That’s where effective production tracing and error logging comes in.
A recent survey of AWS customers published by Cloudability claims that AWS container usage grew almost 250 percent in a year, and the serverless adoption grew almost 700% in the same period. The more our software ends up depending on other people’s platforms and components, the more it becomes important to shine light on unknown-unknown problems. It’s no wonder that there is a surge in monitoring, tracing and logging tools for cloud deployments today. Amazon X-Ray became available in April last year, followed by IOPipe in August. Something new pops up in that space almost weekly. For example, Thundra was announced to address a similar problem just last week. This whole space is rapidly evolving.
I recently ran a survey of testing practices for JavaScript apps. Roughly half of the respondents had no idea that automated tools for production error tracking even exist. Just for context, 88% of the people from that group used at least one test automation tool, so this is within a self-selected group of teams who take testing seriously. I would imagine that, looking at a wider industry spectrum, automated production error tracking is even less known. Yet it’s an unexplored gold mine of information that could hugely improve the operability and observability of modern applications. Ironically, most web apps already have access to a basic error mining tool, they are just not using it. A little-known fact about Google Analytics is that it also supports exception tracking, which makes it easy to at least collect some basic aggregate information about errors.
As an industry, we need to get better at understanding production errors, and we need to do that quickly. Teams need to learn how to mine production errors for insights. This is the modern equivalent of tailing the log file while running an exploratory testing session. The context is different because the events happen in production, and they are generated by users instead of a tester, but the purpose is the same. The tool helps us watch out for strange and unexpected events, and investigate them to discover unknown-unknowns. The big difference between a tester’s log file and production errors is the ratio of signal to noise.
https://gojko.net/