Shorts Software Engineering

The only reason we should trigger alerts

There’s only one reason we should ever trigger alerts: There is a failure or risk of failure we should act on.

Everything else is just noise.

It’s common to trigger alerts for non-failure situations or errors that require no action:

  • An API is down and will be retried later.
  • There was an exception, and an acceptable alternative path was used.
  • The API was called with bad data.

In each of these situations, the alerts are just noise because there’s no action required. The criteria for alerting should be slightly changed to capture situations that require action:

  • An API is down and all retries have been exhausted.
  • There was an unexpected exception, so no acceptable alternative paths were available.
  • The API was called with bad data at a higher frequency than expected.

The last case is subtly different. For internal APIs, we might never expect bad data, so we could choose to alert any time we get data that doesn’t conform to spec. For a public facing API, we might set some threshold that alerts us to confused users, but it’s unlikely we want developer intervention every time the API gets called with bad data.

We should limit our alerts to only failures that require action. If we do so, we’ll remove noise and ensure failures get the attention they deserve.