What is a Service Mesh?
If you already know what a Service Mesh is, then you can skip this section. For everybody else, I want to give a short wrap-up about this concept and explain the problems it solves. For more details, you can also read this blog post by one of my colleagues.
Imagine the architecture of your application looks similar to the simplified one below. Sooner or later during development, you probably will be facing some problems: There is this team member who re-deployed a service, and the service’s address changed. Your database crashes because a service sends constant retry requests, and your service is not able to come up again. You need to debug a problem, but you are missing metrics. Or now, your project requires that the microservices communicate via TLS with each other.
These are just a few examples of infrastructure problems that distract developers from implementing business logic. And this is where the Service Mesh comes to the rescue: Why not just separate all this from the business logic?
The Service Mesh pattern introduces a sidecar for each microservice. Now, the communication only takes place via proxies. This enables to add many features which will be handled by the sidecar and not the microservice anymore. Examples are service discovery, circuit breaking, authentication and authorization, encryption, or load balancing. The proxies are also referred to as data plane.
In addition, the control plane manages all sidecars. It places the sidecars next to the microservices and controls all the configurations.
As the Service Mesh is only a pattern, we still need a pattern implementation or a framework: Linkerd and Istio are the most popular Service Mesh implementations. Istio injects containers of the Envoy proxy as sidecars.
The Service Mesh and Asynchronous Communication
The Service Mesh sounds like a good solution – if your microservices are communicating synchronously.
Maybe you read about the Service Mesh pattern and wanted to apply it to your event-driven architecture. Maybe, you also saw Gwen Shapira’s talk on Service Meshes and Kafka, or you read Kai Waehner’s blog post about it. And now you think: Amazing! I want this!
Here comes the disillusionment: This is currently not possible, or at least with many limitations and not for production environment. Service Mesh frameworks currently mainly support HTTP, but no other protocols that are usually used in asynchronous systems, e.g., Kafka, AMQP, or MQTT.
There was a Github issue in 2018 that collected ideas on what could be implemented if Envoy supported the Kafka protocol. The suggestions were promising, but unfortunately, development did not move quite forward from this point on.
To take advantage of all the benefits offered by the Service Mesh, the proxy must understand the corresponding layer 7 protocol. Envoy only fully supports HTTP, gRPC, MongoDB, DynamoDB, and Redis on layer 7. There also is an experimental contrib image for other layer 7 protocols like Kafka, RocketMQ, or MySQL. However, even if this image includes Kafka, the functionalities are still limited. And other protocols like AMQP, MQTT, or JMS are not supported.
For Kafka, Adam Kotwasinski developed two Envoy filters which at least enable a few features. The Kafka Broker filter enables to collect Kafka metrics. Also, the Kafka Mesh filter supports handling the dynamic routing of events depending on the topic name. Currently, the latter cannot be used with Istio since this collides with the TCP proxy filter that is needed for every sidecar in the Service Mesh.
Besides these layer 7 functionalities, it is possible to use some basic but essential features on layer 4/TCP level:
- authentication and authorization using mTLS (Istio enables easy configuration via automatic mTLS)
- encrypted communication using TLS
- basic TCP proxy functionalities, e.g., routing of requests
Alternatives to the Service Mesh
As the idea of a Service Mesh for asynchronous communication seems very popular, it is surprising that the development did only small steps until now. Also, Confluent published a white paper where they state that the benefits of Istio for Kafka only apply for specific requirements currently. As the functionalities of the Envoy filters for Kafka evolve, this will probably change.
Requirements like those listed in Envoy’s Github issue could also be addressed differently and not only by using proxies in a Service Mesh. Instead of implementing additional filters for Envoy, it is also possible (and probably easier?) to directly develop a protocol-specific proxy. AsyncAPI is working on an Event Gateway that should realize many of the proposed ideas, e.g., schema validation of events, filtering, manipulation, aggregation, and many more.
Another option is to use a tool that only addresses the needed requirements. For example, if protocol translation is needed, Strimzi’s Kafka bridge allows clients to use AMQP 1.0, HTTP, or the Kafka protocol. As another example, this Kafka proxy terminates TLS requests and handles communication with Kafka clusters. It can also translate Kafka’s advertised listeners or enable authentication and authorization using SASL and LDAP.
The downside of these approaches is that if multiple proxies are needed, these need to be managed by yourself. Then, they are not controlled by something like the control plane of a Service Mesh. Additionally, if new requirements arise, they may not be met by the proxy and as a result, another component will be required.
Event Mesh to the rescue?
When researching the topic “Service Mesh and asynchronous communication”, the term Event Mesh might also pop up at some point. Especially Solace and Redhat are using this term on their websites, in white papers, and YouTube videos. The name suggests that an Event Mesh is just the same as a Service Mesh, but for asynchronous and event-driven communication. Isn’t this what we required in the above section?
Event Mesh and Service Mesh solve problems, that originate from the same problem space – but do not answer the same questions. Both patterns have in common that they may become relevant when the architecture of a system is growing, and more systems and services need to be connected. All this needs to be managed reliably.
So, what are the characteristics of an Event Mesh? When analyzing the white papers and website articles of Redhat and Solace, the following aspects become clear: An Event Mesh…
- … consists of a cluster of event brokers which store the events durably. Like that, events can be processed as they occur or retrospectively.
- … connects and supports diverse deployment environments: Cloud, non-cloud, multi-cloud, and hybrid-cloud.
- … enables an independent technology stack: Protocol compatibility and translation and multiple client APIs.
- … meets basic requirements: Scalability, reliability, and security.
The architecture looks then like below:
If you are familiar with different event brokers, you might notice that most of these requirements can already be met. In contrast to the Service Mesh for event-driven architectures, the Event Mesh can already be implemented for the most part.
For example, Apache Kafka enables to set up a scalable cluster of Kafka brokers across multiple cloud regions and data centers. It is also possible to connect an on-premise cluster with a cloud cluster. Additionally, as there is an engaged community, there are client libraries for every widely-used programming language.
The only critical aspect is protocol compatibility and translation, which is still limited with Kafka. Nevertheless, it is possible to build custom protocol translation solutions like already mentioned before. Besides that, there are other event brokers which natively support of multiple protocols, e.g., Pulsar which enables usage of Kafka, MQTT, AMQP and RocketMQ. Solace and Redhat also explicitly advertise that an Event Mesh can be implemented with their products.
To summarize, Event Mesh and Service Mesh are not the same. As it is also stated by Solace, both can complement each other. Nevertheless, for event-driven microservices, the problems that are addressed by a Service Mesh are still not solved by an Event Mesh. While the sidecars are the central aspect of a Service Mesh, they do not play a role in an Event Mesh. Event-driven microservices still implement non-business logic that could be handled by a proxy. Therefore, it is still needed that solutions like Envoy also address asynchronous communication.
The Service Mesh pattern does not only provide advantages for synchronous but also for asynchronous and event-driven communication. The discussion on Github resulted in many valuable ideas on how a proxy could be used in an event-driven architecture. However, the implementation of these suggestions happens only slowly. In the meantime, other projects try to close the gap and implement protocol-specific proxies.
Unfortunately, in the current state of development, a Service Mesh can only provide limited benefits for an event-driven architecture, e.g., authentication, metrics collection, and routing.
On the other hand, the term Event Mesh is used by companies like Redhat and Solace. The name suggests that the Event Mesh and the Service Mesh pattern are similar paradigms. But in fact, both solve different kinds of problems. The requirements that are addressed by a Service Mesh cannot be met by an Event Mesh for asynchronous systems. Therefore, it remains to hope that new Envoy filters will be developed or that other projects address the problems.
What are your use cases for the Service Mesh pattern and event-driven services? Feel free to leave a comment!