22. April 2021
5 min

Kubernetes Service Meshes - the What and the Why

This is the first in a small series of blog posts about Kubernetes Service Meshes. While later posts will deal with some specific implementations and their respective details, here I first start out by introducing the general picture and the basic concepts.

All contents presented here will be covered in more detail as part of the Novatec Cloud Platform Journey.

Background

In a cloud-native environment, e.g. in a Kubernetes cluster, each software component of an application is free to be implemented as a microservice using any technology that seems fit for the task (i.e. altogether running a polyglot stack), as long as they are able to communicate with each other in a reliable and secure way. For example, such a stack might then look like this:

Microservices with Ingress

Even in such a simple case the microservices are heavily reliant on the network, and even more so the more interaction between the individual components takes place. And that’s just where a Service Mesh now comes into play.

For what I do not have yet in that setup are network and security policies and also no detailed insight into what is going on on the network layer within the cluster. So far each Pod is able to communicate with each Service, and their data exchange is not encrypted in transit. And I have no means of selectively splitting the traffic within the cluster, e.g. for smart load distribution for canary or phased deployments, or to rate-limit (or retry) certain requests, and so on. Of course, to achieve these features I could now start to implement appropriate measures in the software components, but that introduces considerable overhead in development, the more so the more diverse the software stack is, so I’d rather utilize a standardized and encompassing set of tools to tackle this on the platform level.

A Service Mesh

  • allows me to separate the business logic of the application from observability
  • provides network and security policies
  • allows me to connect, secure, and monitor microservices

Just as containers abstract away the operating system from the application and its direct software dependencies, a Service Mesh abstracts away how inter-process communications are handled. In general, a Service Mesh layers on top of Kubernetes infrastructure and is making communications between services over the network safe and reliable, allowing to both overview and control the traffic flow:

Traffic Control

This overall sounds a bit similar to what I have shown with Ingress, but Ingress rather focuses on calls from the outside world into the cluster as a whole, whereas a Service Mesh (mostly) deals with the communication inside a cluster.

How does it work?

In a typical Service Mesh, service Deployments are modified to include a dedicated sidecar proxy container, following the ambassador pattern in the context of multi-container Pods. Instead of calling other Services directly over the network, service Deployments are automatically modified to let each affected pod call their local sidecar proxies, which in turn encapsulate the complexities of the service-to-service exchange on the pod level, providing all the features hinted at above. The interconnected set of such proxies in a service mesh implements what is called a data plane. These sidecar proxies act as both proxies and reverse proxies, handling both incoming and outgoing calls.

So, this

Inter-container communication without data plane

becomes this

Inter-container communication with data plane

On the other hand, the set of control structures that steers proxy behavior across the Service Mesh is referred to as its control plane. The control plane is where policies are specified and the data plane as a whole is configured: it’s a set of components that provide whatever machinery the data plane needs to act in a coordinated fashion, including service discovery, TLS certificate issuing, metrics aggregation, and so on. The data plane calls the control plane to inform its behavior; the control plane in turn provides an API to allow the user to modify and inspect the behavior of the data plane as a whole:

Control Plane and Data Plane of a Service Mesh

Both, a data plane and a control plane are needed to implement a Service Mesh. The actual details, however, depend on the chosen implementation as I will show in future blog articles.

The process of adding sidecars to deployment artifacts and registering them with the Service Mesh control plane is called sidecar injection. All common Service Mesh implementations support manual and automatic sidecar injection.

And while at the first glance it might seem troubling to introduce two additional hops into each inter-pod communication, each increasing latency and consuming cluster resources, and also spending resources for the control plane, it does have its merits as I will show later on.

Implementations

A Service Mesh is not automatically part of a Kubernetes cluster. To choose the implementation that best fits a cluster might be a difficult task, as there are several available, some more focused in their scope and thus easier to deploy and incurring a lower overhead (e.g. Linkerd), some more general in their approach and thus more featureful but with quite some resource demands (e.g. Istio), and some are offered by specific cloud service providers (e.g. AWS App Mesh or Microsoft’s Open Service Mesh (OSM)) or rooted in a specific ecosystem (e.g. HashiCorp’s Consul Connect). Confer to this overview and to these Selection Criteria for more details and the latest CNCF survey for their relative popularity.

And one may deploy any number of Service Meshes within a cluster and add resource annotations or labels for selecting the desired one ad-hoc, or even install and configure them each to only serve a specific namespace, allowing one to always pick the best for the task at hand, of course at the price of additional complexity.

While the value of the service mesh may be compelling, one of the fundamental questions any would-be adopter must ask is: what’s the cost? Cost comes in many forms, not the least of which is the human cost of learning any new technology, but also the resource cost and performance impact of the Service Mesh at scale. Generally they only incur an acceptable overhead at regular operating conditions, when compared to bare metal, but there are considerable differences. For instance, see a benchmark analysis of Linkerd and Istio.

Kubernetes Service Mesh implementations covered in future posts

I will focus on the most commonly used Service Mesh implementations as per the most recent CNCF survey. However, as those greatly differ in their respective implementation details they will each be dealt with in a separate article. So, in future posts I will provide further information on

  • Linkerd, a CNCF incubating project. Linkerd is the very first Service Mesh project and the project that gave birth to the term service mesh itself.
  • Istio, an open platform-independent service mesh that provides traffic management, policy enforcement, and telemetry collection.
  • Consul Connect, service-to-service connection authorization, encryption and observability rooted in the HashiCorp ecosystem.
Image Sources: (c) Samuel Zeller on Unsplash

Comment article