When building fault-tolerant services, reliably and consistently handling errors is a top priority. Errors will always happen in production, are usually transient and your service should be able to survive an error spike; defining an SLO will give you the error budget you need to work with them.

But how should you handle errors, really? More specifically, in case a method in your code returns an error / throws an exception, what is the “correct” way to handle it? I’ve worked in places where every error was logged and relied on alerts to get the developer’s attention. Aborting (crashing) the service on startup if any of the required dependencies aren’t available is also a possibility, though in a job interview arguing for this was cited as a cause of me not getting an offer.

Should you ever abort then? Isn’t it better to just log errors and rely on alerts so that your service is always up? Well, the “senior engineer” answer is “it depends”. The actual answer is “it depends, but here are some guidelines.”

Abort if possible, return error otherwise.

Aborting a service is actually good. It gives a clear signal that something is wrong with the application and most teams will set alarms for any aborts, usually with detailed stack traces, bringing the dev’s attention to it. But, crashing a service means reduced availability and restart penalties: depending on the service, it could take a while for it to go back to a usable state.

Errors don’t crash, also send a clear signal but can become invisible: if they only occur on a specific code path with a specific traffic pattern, then they become blips in your error logs1. If you eventually look at them, it might not be possible to reproduce the scenario easily.

Aborting brings the problem to the forefront but brings your service down. Erroring could mask it until it’s too late. When should you use either?

Before that, let’s make a few assumptions:

  • You have at least 2 replicas of your service in production (pods for the k8s crowd)
  • Updating a service will not remove the replica running the older version until the new replica is healthy.
  • You have good test coverage and automated testing so any aborts / error paths are tested.
  • You have alerts for any abort and for X% of errors (doesn’t really matter which percentage as long as it is low, say less than 5%).
    • Healthy, as described above, means that service is responding with less than the SLO error rate.

Abort on startup

Any errors that occur at startup that would make any part of your service unusable should result in an abort. It could be argued that starting the service with degraded performance is a better option and it may well be in your case. But, that makes its maintenance and production troubleshooting harder.

Unless you are a part of the team or the team has very good documentation on the degraded scenario (and let’s be real: most teams would not have documented it), it would at best be a distraction when trying to solve a real production issue and at worst could completely derail you from solving it.

Never abort on user input

User input is an obvious case where you should always return an error, even if it could stay invisible for a while. Aborting on user input is a denial of service (DoS) attack vector, rendering your service unusable by anyone until you fix the code. Returning an error would affect only that code path, giving you a degraded but still available performance.

Abort on programming errors

This one is a bit more contentious, especially since it can be ambiguous whether it is a programming error or not. If you are in doubt, then by all means return an error. But if you can confidently identify it as one, aborting is the correct strategy.

To illustrate, let me show you an example in Go. Go has a sync.Map type which is a concurrency safe hash table. It was created before generics so both the key and value have the empty interface (any) as type. Callers are supposed to convert to the correct types when needed.

Suppose I use this map in a database sharding context: the key is the shard name and the value is that database connection (*sql.DB) for the shard. Suppose also that this map can only be updated by your team, through tolling you built. There is no user input involved in deciding the shard name nor inupdating it, it is 100% under your control.

In this case, it is reasonable to write code like this:

// Conn returns the database connection for the given shard.
func (s *Service) Conn(shard string) *sql.DB {
  value, ok := s.conns.Load(shard)
  if !ok {
    panic("invalid shard")
  }
  return value.(*sql.DB)
}

// UpdateConn updates that shard's database connection to newConn.
func (s *Service) UpdateConn(shard string, newConn *sql.DB) error {
  prev, existed := s.conns.Swap(shard, newConn)
  if existed {
    // Handle closing the connection in a safe way
    db := prev.(*sql.DB)
    ...
  }
}

In both Conn and UpdateConn, value is being cast to *sql.DB. If the type cast is invalid, it would abort the service, To be safe, it should have actually been written as:

db, ok := value.(*sql.DB)
if !ok {
  // This is a programming error, as it should always be an *sql.DB
  return fmt.Errorf("invalid database")
}
...

But that is not really needed because s.conn should always have *sql.DB values as types. If it doesn’t then a programming error occurred at some point and the service is no longer correct. The same goes for the shard name: if it is 100% under the programmer’s controll which shard to use, then panicing as in the example is an acceptable solution.

Aborting in this case is actually the better option because it forces the server to go back to startup mode. If the error was at startup, it will crash-loop. If the error was in some toolling that corrupted the map, chances are it would be corrected on startup.

Also, the API for the two methods would have to change to return (*sql.DB, error) instead of just *sql.DB, forcing callers to check for the error and making the API call more complex than really needed.

Aborting makes the API cleaner

This is a clear advantage of being able to abort: your API is simpler to use since you don’t need to check for an error or exceptions. In Go, for example, I wouldn’t have been able to chain calls with Conn if it returned an error. Instead of writing

err := s.Conn(shard).Query("your database query")

I would have to write

db. err := s.Conn(shard)
if err != nil {
  return err
}

if db == nil {
  return fmt.Errorf("invalid shard")
}

err = db.Query("your database query")

Which is clearly more verbose and, in this context, unnecessary.

So if you can, abort your service. There are both reliability and api advantages to it. Otherwise, return an error and make sure you have good monitoring and alerts in place.

Notes


  1. Unless you define your error SLO to be 0.0% errors, which you should never do. ↩︎