An Open Letter To Monitoring/Metrics/Alerting Companies

I’d like to open up a dialogue with companies who are selling X-As-A-Service products that are focused on assisting operations and development teams in tracking the health and performance of their software systems.

Note: It’s likely my suggestions below are understood and embraced by many companies already. I know a number of them who are paying attention to all areas I would want them to, and/or make sure they’re not making claims about their product that aren’t genuine. 

Anomaly detection is important. It can’t be overlooked. We as a discipline need to pay attention to it, and continually get better at it.

But for the companies who rely on your value-add selling point(s) as:

  • “our product will tell you when things are going wrong” and/or
  • “our product will automatically fix things when it finds something is wrong”

the implication is these things will somehow relieve the engineer from thinking or doing anything about those activities, so they can focus on more ‘important’ things. “Well-designed automation will keep people from having to do tedious work”, the cartoon-like salesman says.

Please stop doing this. It’s a lie in the form of marketing material and it’s a huge boondoggle that distracts us away from focusing on what we should work on, which is to augment and assist people in solving problems.

Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know.

My suggestion is to first acknowledge this (that your attempts to detect anomalies perfectly, at the right time, is not possible) when you talk to potential customers. Want my business? Say this up front, so we can then move on to talking about how your software will assist my team of expert humans who will always be smarter than your code.

In other words, your monitoring software should take the Tony Stark approach, not the WOPR/HAL9000 approach.

These are things I’d like to know about how you thought about your product:

  • Tell me about how you used qualitative research in developing your product.
  • Tell me about how you observed actual engineers in their natural habitat, in the real world, as they detected and responded to anomalies that arose.
  • Show me your findings from when you had actual UX/UI professionals consider carefully how the interfaces of your product should be designed.
  • Demonstrate to me the people designing your product have actually been on-call and have experience with the scenario where they needed to understand what the hell was going on, had no idea where to start looking, all under time and consequence pressure.
  • Show me the people who are building your product take as a first design principle that outages and other “untoward” events are handled not by a lone engineer, but more often then not by a team of engineers all with their different expertise and focus of attention. Successful response depends on not just on anomaly detection, but how the team shares the observations they are making amongst each other in order to come up with actions to take.

 

Stop thinking you’re trying to solve a troubleshooting problem; you’re not.

 

The world you’re trying to sell to is in the business of dynamic fault managementThis means that quite often you can’t just take a component out of service and investigate what’s wrong with it. It means diagnosis involves testing hypotheses that could actually make things a lot worse than they already are. This means that phases of responding to issues have overlapping concerns all at the same time. Things like:

  • I don’t know what is going on.
  • I have a guess about what is going on, but I’m not sure, and I don’t know how to confirm it.
  • Because of what Sue and Alice said, and what I see, I think what is going on is X.
  • Since we think X is happening, I think we should do Y.
  • Is there a chance that Y will make things worse?
  • If we don’t know what’s happening with N, can we do M so things don’t get worse, or we can buy time to figure out what to do about N?
  • Do we think this thing (that we have no clue about) is changing for the better or the worse?
  • etc.

Instead of telling me about how your software will solve problems, show me you’re trying to build a product that is going to join my team as an awesome team member, because I’m going to think about using/buying your service in the same way I think about hiring.

Sincerely,

John Allspaw

 

17 Comments

  1. Joao   •  

    That’s why I like so much Appdynamics …

  2. Micheal   •  

    Looks like vendors have already started to comment and brag about their niche solutions. Anyways good post. My experience is… if hear a vendor says we so end to end transaction monitoring, my advice is to simply run away from them. You will keep circling around the UI of that vendor’s tool which will say …. our map says problem lies there. Go figure.

  3. NorthernDBA   •  

    Great article!
    Finally, someone told the truth once!

    Or if someone decides to buy/use such a tool someone from the DBA team has to write all the rules and action in case of any failure…

  4. Nik   •  

    hi John. First of all, I wish to congratulate you on this commendable and unconventional take on monitoring solutions. I do completely agree with you the success criteria for finding the right solution. I have come across several tools as a monitoring solution consultant, however, a very few actually follow this ideology that is important to stand up a real monitoring solution. I like how you start by saying anomaly detection is just not truly possible beyond marketing as I have first hand experience with such problems and I know it’s the real truth behind all the fad. Thanks for the nicely carved out comments and suggestions

  5. Steven Acreman   •  

    Hi, I probably work for one of the companies being addressed. We don’t have one of the flashy marketing sites that fill the internet promising instant fixes through machine learning, anomaly detection and a bunch of other snake oil, but I’ve seen them and to some extent I can see why it happens.

    I think the marketing hype is fuelled by how investment works. Having sat in numerous VC meetings now presenting on ‘how we are different’ I can say that it’s a real struggle to convey how important it is to solve the basics. Monitoring is too complex, takes too long and often gets in the way of humans. Yet that story is weak compared to ‘we solve big data issues through the use of patented algorithms’. Nobody wants to invest in going back to the beginning to fix the fundamental problems, so what you end up with is companies funded that are a feature on top of what existed before.

    A lot of companies therefore have marketing websites that far exceed the quality of their product. Luckily with SaaS you get to try before you buy and it’s a technical sale usually to fairly knowledgable people who try a few products before settling (or like many use quite a few tools).

  6. Jason Simpson   •  

    Hey John, overall great post. I especially liked the bullet points under “These are things I’d like to know about how you thought about your product”. I got a chance during the early release stage of our product to sit in several large data centers and observe how end users interacted with our product. I completely agree this type of feedback is critical to help deliver a quality product. Working with end users should be the golden rule for any product owner.

  7. Pingback: Self-Repairing-Monitoring Solution oder was - SQL aus Hamburg

  8. Pingback: Ghosts in the machines - O'Reilly Radar

  9. Great letter John.

    I read it at an interesting timing as I just got off a call with a customer who said he compares indeni to FTE (full-time equivalent). For those unfamiliar with this comparison – the customer asked himself how many people he needs to hire to get done the same work that indeni does. He said that no other product can do what we do and humans are generally bad at doing it, therefore replacing humans for the task of finding misconfigurations in their infrastructure with software made sense. However, it is humans who need to digest the output from our software – an observation of his that I agree with.

    The questions you ask are great, however I noticed that most customers don’t go that far. They ask simpler questions:
    1. What can your product find that others can’t? (we share a list of sample issues)
    2. How much work do I need to invest to get it up and running? (45 minutes – they don’t believe us and then try and find out its true)
    3. How much do I need to invest to keep it running on a weekly basis? (generally, a few minutes a day to review the daily reports)
    4. How much does it cost? (an arm and a leg 🙂 )

    This has led us to focus our product on being extremely easy to set up and use and assume that 99.99% of the time the UI isn’t even open – the users rely on the emails the system sends them.

    And you’re right – watching people in action, fighting issues regularly, is the best way to come up with this conclusion.

  10. Richard Cook, MD   •  

    Great post & revealing comments. Adding layers of automation to bawky, awkward automation is rarely productive even if those layers are graceful.

    This was described in the past as “nosocomial automation”. Nosocomial means ‘from those who tend the sick’ and refers to infections from bugs that, through constant exposure to antibiotics, have developed resistance to those antibiotics. The traditional approach to nosocomial infectious diseases has been to create even more powerful antibiotics, the use of which leads to an even more resistance.

    Nosocomial automation is put in place to overcome the difficulties with automation — more specificially, difficulties that we have making sense of what the automation that is already in place is actually doing. The result isn’t elegant or even pretty. The processes of functional accretion and extended size tend to pile up on again and again.

    There is a deeper problem here. Gathering and presenting data — even in sophisticated ways — are a fraction of the problem confronting the operator. From these data the operator must develop and test notions of what has happened, what is going on, and — most importantly — what can be done that will both fix the current problem and not spawn new ones.

    When we watch operators we discover that hypothesis generation and hypothesis testing are often combined with corrective action. Indeed, the way that things get diagnosed is often by fixing them, i.e. doing things in the reverse order to our cannonical notion of what ‘must’ be happening in ops. John Flach wrote about this in a funny paper entitled “Ready, Fire, Aim” [Ready, fire, aim: Toward a theory of meaning processing systems. Attention & Performance XVII, Cambridge: MIT Press, 1999] and David Woods described the joining together of all these activities in his paper “The Alarm Problem and Directed Attention in Dynamic Fault Management” [Ergonomics 38(11), 2371-2393, 1995].

    All of us are guilty of the “systems fallacy”. This fallacy is the idea that there is a system over there somewhere and that we are, ourselves, ‘outside’ the system, ie. independent of it. The central (but often ignored) notion of systems thinking is that system includes the operators and the managers and everyone and everything else that influences the processes at the center of our attention. In your setting ‘the system’ includes those things with the blinking lights and the stuff monitoring them and the people monitoring the stuff monitoring them and so forth.

    Perhaps the most stressful question facing the operator during an anomally is, “What will happen if I push that button?” The underlying automation is so complicated and indeterminate that the operator can only guess at source of the anomally. If pushing the button fixes the problem we will later conclude that he or she had the correct understanding. But at the moment, pushing the button is both a possible fix for the problem and a hypothesis about what the source is. In a funny way, the therapy becomes a diagnostic test. Failure of the therapy is an indication that the diagnosis was probably wrong. It is for this reason that operators usually proceed from pushing small buttons to pushing larger ones and even large ones before finally pulling down on the big red switch handle in the corner.

    Nosocomial automation is identifiable by its inward, technological ‘view’. It seeks to make up for the underlying complexity of the automation. But the real challenge today is to find ways of aiding the operator, ie. taking an outwards, user-centered view. Is this all about UX? Well, better to say that there is a huge, unexamined area out there called OX.

    Nosocomial automation tends to make easy things easier while the hard stuff remains hard. The supreme irony of nosocomial automation is that it makes routine stuff seem easy to do. Because easy stuff is routine it’s entirely possible to apply nosocomial automation and feel that progress has been made. The existence of orderly call rotas and escalation processes is testament to this. They work — mostly — and so long as they do only one person doesn’t sleep soundly. But does anyone think that this is going to remain the case forever?

    My point (sorry about taking so long to get to it) is that monitoring is only one part of resilience. Its value depends a lot on what lies on the reacting, anticipating and learning that is happening elsewhere in the ‘system’ — mostly in the operators. Until our systems analysis includes them its just another systems fallacy.

  11. Peco   •  

    John, thank you for sharing your thoughts (which I largely agree with and find encouraging). Our industry is stuck in the vile circle of managing incidents vs systematically eliminating defects/flaws that could turn into incidents. Monitoring data is almost always looked at after the fact (when shit breaks I get alerted and I go investigate). What needs to happen to get past that? Question to both vendors AND operations ninjas.

  12. Matt   •  

    Hi John, it looks like the link to “dynamic fault management” is broken. I was able to find the paper by searching for https://duckduckgo.com/?q=“dynamic+fault+management”

    Thought you would want to know.

  13. allspaw   •     Author

    Thanks Matt! The URL is fixed to work now. 🙂

  14. Pingback: The Art of Structuring Alerts: The Pain of False Positives

  15. Denilson Nastacio   •  

    A few randomly related lessons learned in back-to-back roles in operations engineering and machine learning (long story) :

    1. Automatic identification of cause(s) for outages requires access to all variables driving the outcome (outage on component A or service B) .

    2. If you don’t have all the variables, neither machines nor humans can identify the correlation, but humans are just brilliant at iterating towards missing variables.

    3. What humans do in #2 is virtually impossible to automate from the outside, without complete knowledge of the bale-of-hay system at the customer and modeling of system into nicely digestible variables.

    4. Current ML is great at looking at a set of variables and outcomes and telling people whether there is a correlation and, if so, predicting outcomes. Elasticsearch, as one example, is doing great in that space.

    5. Most importantly, anyone who starts selling you AI before asking whether you have a established Business Intelligence practice (numeric analysis, dashboards, data collection/usage integrated into day-to-day activities) , just wants your money and has no interest in your success.

  16. Wael Altaqi   •  

    I work for a SaaS vendor. I do agree with overall premise that perfect anomaly detection may not be possible, however, the challenge is that often times the traditional metric warning / critical thresholds is unknown or impossible to quantify. Ultimately the application developer / infra architect should specify quantify thresholds for for every related metric. The reality though is that they leave this task to the operator who does not have the required expertise for such task.

    Anomaly detection backed by proven statistical models like standard deviation against time-series databases are used all the time in various applications, metric data from infra and applications are not much different

    I would say that a vendor who cannot explicitly describe how they reach anomaly conclusion is not worth its salt. ‘Black Box’ anomaly detection is for suspicious 🙂

Comments are closed.