Microservice Communication Design Patterns
The goal of the microservices is to sufficiently decompose/decouple the application into loosely coupled services organized around business capabilities. The distributed micro-units collectively serve application purposes.
Transactions (Read and Write) spanning over multiple services becomes inevitable after breaking a single application into microservices. Then communication across microservices boundaries — workflow management — data storage mechanism becomes challenging. The system should adhere to a canon known as the Fallacies of distributed computing. ACIDity guaranteed by the database system cannot be ensured when the transaction is handled across multiple services (each with its own business logic & database). CAP theorem dictates that you will have trade-offs between consistency(C) and availability (A) since Partition tolerance (P) is an undesirable reality in a distributed system. In this blog post, we will explore the solutions to these challenges and design patterns.
Coordinating Inter-service Communication
Clients & services targetting a different context & goals can communicate through different mechanisms. Depending upon protocol, it can be synchronous or asynchronous.
Synchronous Communication — Request Response Approach
In synchronous communication, a predefined source service address required, where exactly to send the request, and BOTH the service (caller and callee) should be up and running at the moment. Though Protocol may be synchronous, I/O operation can be asynchronous where the client need not necessarily wait for the response. This is a difference in I/O and Protocol. The common request-response approach common to web API includes REST, GraphQL, and gRPC.
Asynchronous Communication
In the case of asynchronous communication, callers need not have the specific destination of the callee. Handling multiple consumers at a time becomes relatively easy (as services may add up consumers). Moreover, the message queues up if the receiving service is down & proceeds later when they are up. This is particularly important from the perspective of loose coupling, multi-service communication, and coping up with partial server failure. These are determining factors for inclining microservices towards Async communication. Asynchronous protocols like MQTT, STOMP, AMQP are handled by platforms like Apache Kafka Stream, RabbitMQ.
Understanding where and when to use synchronous model versus asynchronous model is a foundational decision to designing effective microservice communication. You can analyze the REST-based vs asynchronous communication in microservices here.
Message & Event
In asynchronous communication, the common mechanism is messaging & event streaming.
Message
A message is an item of data that is sent to a specific destination that encapsulates the intention/action (what has to happen) and distributed through channels such as messaging. Queues store messages until they are processed and deleted. In a message-driven system, addressable recipients await the arrival of messages and react to them, otherwise lying dormant.
Event
The event encapsulates the change in the state (what has happened) and listeners are attached to the sources of events such that they are invoked when the event is emitted.
Domain Events — Event associated with the business domain generated by the application (OrderRequested, CreditReserved, InventoryReserved in the following diagram). These events are a concern for Event Sourcing.
Change Events — Event generated from database indicating state transition. These events are a concern for Change Data Capture.
Microservice Principle — Smart consumer Dumb Pipe
The microservice community promotes the philosophy of smart endpoints and dumb pipes. Martin Fowler advocates what he calls smart endpoints and dumb pipes for microservices communication. ESB that ruled the SOA universe has multiple problems associated with complexity, cost, and troubleshooting.
Protocols of Asynchronous Communication
MQTT — Message Queue Telemetry Transport (MQTT) is an ISO standard pub-sub based lightweight messaging protocol used widely in the Internet Of Things.
AMQP — Advanced Message Queuing Protocol (AMQP) is an open standard application layer protocol for message-oriented middleware.
STOMP — Simple Text Oriented Messaging Protocol, (STOMP), is a text-based protocol modeled on HTTP for interchanging data between services.
For an in-depth comparison of these protocols, refer here.
Common Messaging / Streaming Platform
Some of the common baseline for evaluation criteria include availability, persistence/durability, durability, pull/push model, scalability & consumer capability. You can refer here for a detailed comparison of these platforms.
Microservices Design Pattern
Microservices are built on the principle of independent and autonomous services, scalability, high cohesion with loose coupling, and fault tolerance. This will introduce challenges including complex administration and configuration. A design pattern is about describing a reusable solution to a problem in a given specific context. We will discuss these patterns to address the challenges to provide proven solutions to make architecture more efficient.
Saga Pattern — Maintaining Atomicity Across Multiple Services
A single transaction is likely to span across multiple services. For example, in an e-commerce application, a new order (linked with order service) should not exceed the customer credit limit (linked with customer service) and the item (linked with inventory service) should be available. This transaction simply cannot use a local ACID transaction.
A saga is a sequence of local transactions that updates each service and publishes a message/event to trigger the next local transaction. In case of failure of any of the local transactions, saga executes series of compensating transactions that undo changes made by preceding local transactions thereby preserving atomicity.
Choreography Based saga — participants exchange events without a centralized point of control.
Orchestration Based saga — a centralized controller tells the saga participants what local transactions to execute.
Choosing among these two patterns depends upon workflow complexity, participants number, coupling, and other factors explained in detail here.
Two-Phase Commit
Similar to the saga, a transaction occurs in two-phase: Prepare & Commit phase. In the prepare phase, all participants are asked to prepare data & in the commit phase, actual changes are made. However, being synchronous with unwanted side effects and performance issues, it is considered impractical within microservice architecture.
Event Sourcing — Alternative to State Oriented Persistence
The traditional way to persist the data is to keep the latest version of the entity state by updating existing data. Suppose, if we have to change the name of a user entity, we mutate the present state with a new user name. What if we need a state rebuild at any point in time or a time travel? In such cases, we need to consider the alternatives to this persistence strategy.
In contrast to this state-oriented persistence, Event Sourcing stores each state mutation as a separate event called event and the application state is stored as a sequence/logs of immutable events instead of modifying the data. By selectively replaying the events, we can know the application state at any point in time. The application persists in the append-only event log called event store. A well-known example is the transaction log of transactional database systems.
Event sourcing depends upon three service layers:
Command: request for state change handled by a command handler.
Event: immutable representation of state change.
Aggregate: aggregated representation of the current state of Domain Model.
Event sourcing is beneficial in terms of providing accurate audit logging, state rebuild — any point of time, easy temporal queries, time travel, performance & scalability factors. Netflix addressed offline download features with event sourcing. The implementation details with a typical example are discussed here.
CQRS — Command Query Responsibility Segregation
What if we design CRUD operation in such a way that it can be handled by two independent reads & write models? It obviously adds complexity to the system but what are the benefits & when do we need it? This segregation facilitates adding another layer of scalability, performance, and flexibility allowing granular read-write optimization in addressing sophisticated domain models.
Event Sourcing and CQRS
These are often cited as complementary patterns.
“You can use CQRS without Event Sourcing, but with Event Sourcing, you must use CQRS” - Greg Young — CQRS and Event Sourcing — Code on the Beach 2014.
As mentioned earlier, the event store consists of a sequence of immutable events. Oftentimes business requirements want to perform complex queries, that can’t be answered by a single aggregate. Replaying the sequence of events each and every time will be computationally costly (and will not be practical in huge data sets). In such a case, segregation will prove beneficial.
Transactional Outbox Pattern
In some contexts, we need to make updates in the database and invoke another action typically on the external system. For e.g, in an e-commerce application, we need to save orders and send an email to the customer. If either of the transaction fails, it could leave the system inconsistent.
In such a case, outbox and message relay can work together to reliably persist state and invoke another action. An “outbox” table resides in the service’s database. Along with the primary changes (for e.g. creating order in order table), the record representing the event(orderPlaced) is also introduced to the outbox table in the same database transaction. In the non-relational database, it is usually implemented by storing events inside the document.
Change Data Capture (CDC)
Application states are persisted in the database. Change Data Capture tracks changes in a source database and forwards those changes to the target destination to synchronize with the same incremental changes. CDC can be Log Based ( transactional databases store all changes in a transaction log ) or Query Based (regularly checking the source database with the query as transaction log may not be available in databases like Teradata).
Considerations for Microservice Design
We will briefly introduce some miscellaneous ideas/principle required while designing microservices.
Idempotent Transactions
Idempotent transactions are those transactions making multiple identical requests that have the same effect as making a single request. In a REST API, the GET method is Idempotent (can be called repeatedly guaranteeing the result same as processing the method once) whereas the POST method is not Idempotent (item keeps adding on each request).
Within the context of a distributed system, you cannot have exactly-once message delivery. Message broker, such as Apache Kafka or RabbitMQ implements at-least-once delivery that creates the possibility of multiple invocations for the same transaction. Thus, in a distributed system consumer needs to be idempotent. If a consumer is not idempotent, multiple invocations can lead to bugs & inconsistencies.
Eventual Consistency
In a distributed system, consistency defines whether & how the updates made to one node/services are propagated to all services. Also referred to as Optimistic Replication, Eventual consistency is simply an acknowledgment that there is an unbounded delay in propagating the change made on one machine to all the other copies.
Network Partition is an undesirable reality of distributed systems that networks can fail. Since Partition tolerance (P) is inevitable, CAP theorem dictates that you will have trade-offs between consistency and availability. If you pick availability, you cannot have strong consistency, but still, you can provide eventual consistency in your system.
Many business systems are more tolerant of data inconsistencies than usually believed favoring availability over consistency. The BASE (Basically Available, Soft state, and Eventual Consistency) system is prized over the ACID system.
“Maintaining strong consistency is extremely difficult for a distributed system, which means everyone has to manage eventual consistency.”
— Martin Fowler
Distributed Tracing
In microservices, metadata associated with the request (that may span over multiple services) will be helpful for different reasons: monitoring, log aggregation, troubleshooting, latency and performance optimization, service dependency analysis, and distributed context propagation.
Distributed tracing is the process of capturing metadata of requests starting from start to end ensuring logging overhead is kept minimum. A unique transaction ID is assigned to external requests & passed through the call chain of each transaction in a distributed topology & included in all messages (along with timestamp and metadata).
Unique Identifiers can be generated by using Database Ticket Server (as used by Flickr), UUID, or Twitter SnowFlake. Common Distributed Tracing tools include OpenTracing, Jaeger, Zipkin, and AppDash.
Service Mesh
Service mesh in microservices is a configurable network infrastructure layer that handles interprocess communication. This is akin to what is often termed as sidecar proxy or sidecar gateway. It provides functionalities such as:
Load Balancing
Service Discovery
Health Checks
Security
Envoy is the popular open-source proxy designed for cloud-native applications. Istio is an open platform to connect, manage, and secure microservices popular in the Kubernetes community.
In order to expose your microservices API towards client application, refer to my blog post: Microservices Design — API Gateway Pattern.
References
Microservices From Design to Deployment — Chris Richardson with Floyd Smith
Uber-microservices-distributed-tracing — https://www.infoq.com
Building Microservices — Sam Newman
Last updated