Service Discovery: One, Some, or All

Service Discovery, to use the basic definition, is the mechanism in which services, residing at a non-static address, can be found on a computer network. There are, in a basic nutshell, three types of service discovery, modeled after how they offer their service endpoints to the clients. This set can be reduced to two types in most real world scenarios.

Service discovery systems are designed to return endpoints, a list of one or more logical host and/or host:port or even URL lists. These endpoints are generally in a group colloquially known as a serverset. Endpoints are also considered ephemeral - they can be de-registered at any time due to failing health checks, or other network events, including an intentional de-registration due to service shutdown.


A service discovery system of the “one” type is concerned with discovering a single logical endpoint, and will generally not attempt to make use of more than one endpoint if that is returned.

A classical case of a “one” discovery type is DNS (though it is not limited to this use case). Many service discovery frameworks and schemes, including some built at a bird website company, do not provide good first class support for this model, or do not encourage its use in a production system relying on manual service registration in a front end server. This leads to hacks, frustration, and general grief. Scripts, developers, and operations really need to be able to run curl on the command line - don’t break this mode of operation for them.


The “some” use case assumes that a client wants a partial view of the serverset. This could be a minority (more than one for redundancy), or a large number (for instance, to load balance a huge volume of requests). This use case is the resilient use case - you accept a smaller than total population. This is also an efficient scheme: a large number of clients enumerating thousands of available endpoints is slow and chatty, and is known on more than one occasion to saturate backends used for discovery. As the world changes under them, such as services moving due to a deploy or fail-over, all of these clients are straining to maintain up-to-date endpoint lists on every change. This isn’t likely to hurt you even with hundreds of clients, but once the 1000 mark is crossed things are going to go south quickly.

DNS is not precluded from acting in a service discovery role, via SRV records and even multiple DNS RR results, which clients needing more than one endpoint can query and enumerate.


The mode of fallacy - this is often the mode many libraries and systems first operate in. The workflow for discovery in non-tolerant systems generally involves:

  1. Find serverset
  2. Start enumerating large serverset
  3. Fail for any reason
  4. Return no endpoints

This isn’t resilient code, but that didn’t stop more than one service discovery agent following this model in my experience. The “all” model does provide one compelling use case, which allows for client-side sharding of data: if all endpoints are registered, clients can avoid using proxies to directly access data on the origin node.

“All”, in practice, never exists. Due to network partitions and software failures, you can never guarantee that all endpoints can be enumerated, and that all endpoints even are registered that want to be. If you are depending on “all” for sharding detection, you need to provide a more resilient model which does not depend on ephemeral registration. A node is in the system, and has a sharded set of data, based on a more static intent map. You can use serversets as a gauge of health, but not of underlying registration.

The failures of the “all” model are especially important to monitoring systems: A lot of designs assume a globally-correct discovery model to find metrics and health checks. While you can provide a mostly globally correct intent model (“I expect these things to be running in this number”), you cannot depend on a “discover everything” model to locate running services.

One of the larger changes at the bird website monitoring stack involved breaking this assumption. While it was serviceable at a small scale, it routinely broke down further and further as scale increased. Replacing the discovery model with a “one” analogue (which however still required reading “some” of the endpoint nodes in Zookeeper to provide some semblance of load balancing) provided an immediate data quality improvement, as only a very small set of endpoints needed to be resolved in order to operate.

And more

With cloud computing and containerization, service discovery will keep on being a space with innovation. Yet, it is also rife with incompatibilities - no two systems are the same. Systems which get this right get back to basics. For example, Kubernetes provides DNS based discovery with load balancers. This allows both the curl use case, and the large request volume use case, to work transparently (something commonly missing in these systems). The state of the industry is improving - lets not all fall into the same traps.