Discover more from Sunday Letters
Bonus post - Noisy Monitors
I write a version of this letter internal to Microsoft each week. This was the one from this week - and enough people internally felt it was useful to external audiences, that I decided to repost it here.
One of the most pernicious, and common, patterns that engineering teams fall prey to is the "noisy monitor". This is any kind of signal - a pager, compiler warning, or bug database - that accumulates noise relative to useful signal over time. This pattern is often the cause of outages and product failures. The problem with noise is that we are pretty good as humans at weeding it out, and we are very good at making up reasons why the warnings we are getting "aren't important". Sometimes we build tools for this, like silencing a pager, or scripts that filter out the "noisy" errors, and sometimes we just...acclimate to it.
There are lots of ways that noisy monitors cause problems. I worked with a team once whose pager went off about 2000 times in a "normal" day (ugh, right?). No problem, they just "stfu 10" every once in a while, to quiet it down - except when they had an actual outage and had to dig though about 1000 new pages a minute to see what the issue was. So much noise that the signal was entirely lost.
But there are more subtle ways that noise in our monitors causes problems. Bug databases are another example of this phenomenon - most teams gradually accrete a large` pile of old, low-ish priority bugs over time. Or even if they're not low priority, there's a temptation to say "well, we haven't fixed that in two years, it must not matter". But this obscures both newer quality issues as well as the overall drift of the product. It's far better to resolve something into the right bucket or "will not fix" to keep the tool clean for the things that matter.
Compiler warnings are another great example where this shows up - the warning is there for a reason, but usually we decide it doesn't matter enough to fix the code to not throw it. So, we silence the warning - and then it doesn't fire when we hit a case that matters. This is a case where zero tolerance is the right approach.
I read a story once about how Thomas Keller, chef of the French Laundry, would come in first thing in the morning and clean and organize all the bottles and small tools. Every day, no matter what, he would reset the kitchen to the base state as cleanly and consistently as possible. I've never seen a software team be that disciplined (at any company), but we ought to be - to keep things that well organized requires a high degree of both clarity about the problem being solved (and what's important about it) as well as discipline not to be distracted...or let monitors get dirty.