Compensation — The Often Neglected Under Tested Under Designed For Challenge of Microservices

The tech debt of Microservices that usually always gets dealt with once the tickets start piling up. Most of us using microservices patterns typically have some set of the following as a basic set of pattern of microservices that include some or even all of these which have compensation challenges.

Sources of Need for Compensation

There are a variety of sources of failures that can lead to the need for compensation:

Availability of a resource
Performance of a resource
Business validation
3^rd^ party API results
Communication failures
System errors

These same failures can also happen during the execution of compensation and also have to be considered as part of the design of compensation.

Non Reversable Actions

Also to consider are certain actions that may happen in an application or in a microservices architecture that may be difficult to reverse or in some cases impossible to reverse. A few examples of this include:

Payments (Authorized and Captured)
Emails sent
Text messages sent

Once these happen in our application, we often have to send more emails or perform explicit refunds and send emails and text messages. We need to explicity consider these non reversable actions as part of our overall compensation design considerations.

Microservices and Compensation

Engineering leaders and architects often rattle off their microservices the use of microservices design patterns and how “modern” their architecture is. Use of a microservices architecture is not for free. Flexibility is often only purchased through the work of designing adequate compensation mechanisms. Compensation is rarely discussed as a first class challenge for Microservices architectures, but it is often the source of tickets and tech debt that must be considered. What follows is a review of several of the microservices design patterns and what the challenge are for creating robust compensation mechanisms that will survive the various sources of failures that can create real challenges in a microservices architecture.

Saga
Decomposition (Domain Specific API)
Asynchronous Messaging
Sidecar
Ambassador
Command Query Responsibility
Database Per Instance

Sagas

Architects often refer to the the “compensation” functions of our favorite saga framework as the answer to compensation challenge. But how often does it actually get implemented? How often is it explicitly tested? We know we should be designing and building robust compensation but how often do we do it?

If we use datastores with ACID transactions that have two phase commit, we can initiate the transaction at the start of the saga as step one and then at the end of the saga we can commit? But we might run into

Read or Update locks on data can block other transactions
If the saga is stuck due to some other step in the saga, the transaction can time out or block continuously

Sagas that aren’t persisted by state change also cannot be rehydrated to be completed and in some cases if the saga is initiated by a user, completing an incomplete saga, or running its compensation may not result in the desired state unless its specifically designed to handle those scenarios.

What happens if our saga compensation step fails? What is our fallback? What if it times out, gets stuck, is blocked by another transaction?

Decomposition

We can make this even more challenging if we use another design pattern such as Decomposition. Let say I have designed a system that has

Customer API
Order API
Inventory API
Payment API

My Saga in my Backend for Front End API is set to work as follows:

Step 1 — Customer API
- GetOrCreateCustomer
Step 2 — Inventory API
- ReserveInventory
Step 3 — Order API
- CreateOrderWithInventoryReservation
Step 4 — Billing
- BillForOrder
Step 5 — Notification
- SendOrderConfirmation

If we fail after step 2 we need to compensate for step 2 and if we fail in step 4 we need to compensate for step 2 and 3. But what happens if our reason for failure is that the saga itself fails? Or the system it’s a part of is reboot or shut down in the middle of the saga?

Asynchronous Messaging

Let’s increase the level of difficulty and take our long running transaction saga described earlier and make part of it asynchronous. If you have ever checked out with Amazon and found that for some reason your payment didn’t go through you are familiar with the fact that Amazon doesn’t process payments synchronously as part of its transaction flow.

To oversimplify the Amazon checkout experience it flows a bit more like this.

Customer Checkout Saga
- Step 1 — Customer API
  - GetOrCreateCustomer
- Step 2 — Order API
  - CreateOrder
- Step 3 — Payment API
  - QueuePaymentForOrder
Payment Authorization Saga
- Step 1 — Payment Queue Listener
  - AttemptPaymentAuthorization
- Step 2 — Fulfillment API
  - QueueOrderForFulfillment

Fulfillment Saga
- Step 1 — Fulfillment Queue Listener
  - AttemptFulfillment
- Step 2 — Inventory API
  - ReserveInventoryForOrder
- Step 3 — Picklist API
  - CreatePickListBOM
- Step 4 — Shipment API
  - CreateShipmentForBOM

Now we have asynchronous communication patterns for all of this, the easy compensation mechanism for duplicate messages through the asynchronous portion is to ensure that an order id makes the payment authorization message idempotent.

But what if the message never reaches the payment queue listener, or its dropped without placing it in a dead letter queue? We need to have some compensation mechanism outside of the customer checkout saga that is constantly checking for orders without payment authorizations and taking the appropriate action.

That same problem exists as we move forward with fulfillment upon completion of the saga related to payment authorization.

Aysnchronous messaging now introduces new layered compensation requirements to handle the complexity related to handling transmission failures through the use of state management and rehydration of asynchronous messaging.

Sidecar

To be honest I am completely biased against sidecars as a microservices pattern, often most of the concerns related to sidecars can be addressed using other patterns such as sagas, application middleware, or existing packaged software libraries. And the reason I am biased against sidecards except as a transition strategy is because of the compensation challenges the sidecar design pattern adds.

The sidecar is another container in a Kubernetes cluster that is added to handle cross cutting concerns such as discovery, security, observability, data access. While it separates concerns it increases the compensation complexity and root cause analysis.

Failure Ambiguity and Root Cause Analysis

When a failure happens in a system using sidecars, where did the failure occur?

The proxy?
The service mesh?
The application container?

We try and improve this with observability and slugs and scopes, but what if the failure is in the observability side car?

Dual-Layer Retry Storms

So if we return to our checkout Saga example and we have added in a mesh control plan with retry and proxy with retry and our application saga has retry. We have three layers of retry happening where we may have compensation in our Saga that is now caught in a retry storm driven by the service mesh and the proxy which wrap our application container. These retry storms create resource exhaustion and even more cascading failures that make compensation even more difficult.

Ambassador

The ambassador is a specialized proxy that handles external communication often to 3^rd^ party APIs that handle concerns such as authentication, retry, circuit breaking, translation to make the use of the 3^rd^ party API easier for the rest of the application. Using our example for checkout with the Saga pattern, it might be the payment gateway that we use an ambassador for handling the payment processing.

Transaction Ambiguity

In our scenario, the payment ambassador processes the payment, but what happens if we have a network issue, our saga calling the ambassador fumbles the authorization, but the payment still exists now untied to the order? What if our retry logic in our sidecar, or in our saga causes the payment to process again?

Idempotency

Ambassadors such as a payment gateway need to translate between systems we need some method to ensure its idempotent. In our example case, we need to ensure we attempt and confirm authorization for an order once, not more than once. This gets even messier from a compensation perspective when the ambassador itself has a failure midrequest.

CQRS

While this pattern offloads reads from transactional stores that often improves throughput and response time for reads, it creates a synchronization problem between the transactional stores and the query stores. Let’s revisit our simple example and add in CQRS.

Customer Checkout Saga
- Step 1 — Customer API
  - GetOrCreateCustomer
    - QueueReadSystemUpdate
- Step 2 — Order API
  - CreateOrder
    - QueueReadSystemUpdate
- Step 3 — Payment API
  - QueuePaymentForOrder
Payment Authorization Saga
- Step 1 — Payment Queue Listener
  - AttemptPaymentAuthorization
    - QueueReadSystemUpdate
- Step 2 — Fulfillment API
  - QueueOrderForFulfillment

Fulfillment Saga
- Step 1 — Fulfillment Queue Listener
  - AttemptFulfillment
- Step 2 — Inventory API
  - ReserveInventoryForOrder
- Step 3 — Picklist API
  - CreatePickListBOM
- Step 4 — Shipment API
  - CreateShipmentForBOM

Non-Existent or Out of Date Query Failures

So within our domain api’s we can either use an asynchronous or synchronous update to our query system. If we use the synchronous process and the command system (system of record) completes successfully but the query system update fails, how do we compensate?

Error — reverse the process, compensate in the microservice, indicate an error to the calling saga so it can compensate?
Syncronous Retry — retry within the microservice synchronously until success or retry fails
Asyncrnromous Retry — queue the retry as the default, or perhaps when synchronous retry fails?

If we use the asynchronous retry, what other systems or functions are impacted by a query function that cannot respond but a command system that is in another state?

Zombie Sagas

What if in our saga the payment API needs to call the customer API but its read system update failed? We now have to create compensation in the payment API to handle the successful completion of the customer API due to its read system failure, or its asynchronous retry that has not yet succeeded

Databases Per Instance

While databases per instance give each microservice exclusive ownership of its data through a dedicated data store makes it easy for microservices to have strong boundaries and data encapsulation. While this creates ownership and technology flexibility it creates significant compensation challenges.

Compensating Transaction Failures

In our simple example, we could use a compensating api/transaction pair that would allow us to reverse our previous successful step in a Saga, but sometimes those compensating transactions can fail to. In fact most implementations of this pattern don’t have compensating transactions.

In addition, the underlying cause for needing to run the compensation transaction can be that the system failed to the point where the compensating transaction can’t occur. This is another blind spot in the design of most compensation approaches.

Designing for Compensation

A few design decisions can help along the way to ensure you have a more robust microservices architecture that are not “pattern” related the microservices architecture, but compensation related.

Assign Identity Early

Whether or not you are using a no-sql store, a datastore that supports ACID, always identify objects that will have persistence and meaning when they are created. By assigning identity early, we can use those ids throughout the flow of data through our microservices architecture to help us:

Determine if it was stored
Use it to determine if the id was processed to help us with idempotency

Ensure Each Step in a Saga is Idempotent

Using the identities generated whether they are simple ids for objects or a amalgamated key, each step of the saga should be capable of determining whether or not the step has already been completed. This is not only true for the originating saga, but also the sagas within the microservices.

Compensation in a Saga is Idempotent

Each compensation capability also needs to idempotent, it should be something that can be called multiple times without error and only operate once.

Every External Microservice Call (Decomposed Domains or Asynchronous Messaging) and Trackable Business Transaction State of Pre and Post State

Transactions may require multiple asynchronous steps in order to complete the entire transaction and that may or may not be part of a single long running saga, storing information about pre and post state of the asynchronous steps allow rehydration of the asynchronous message in the saga or in a separate process.

For instance, we should be able to check orders to determine if payment authorization is complete or has failed by inspecting state so that we can take appropriate actions regardless of the source of failure of the original saga using a simple query of the state of orders that are stuck in “pre authorization” or stuck in “authorization failed”.

Asynchronous and Domain Microservices Should Have Temporal Restart/Compensation

Using the same pre and post state information we should include scheduled monitors to restart or compensate for incomplete state transactions in our microservices architecture. By creating a fall back mechanism for asynchronous

Ambassadors Should Create Their Own Specialized Compensation

Ambasadors create a unique challenge in compensation. Ideally the best design in my experience is an event source system that relies on a combination of event sourcing a confirmation call from the caller and a scheduled queue message to automatically unwind incomplete business transactions:

The ambassador prepares and stores the request with a state as an initial event.
The ambassador creates a delayed queue message to be processed N number of minutes later which will handle reversing the entire process for the event. In the case of a payment in a payment gateway, this may be the reverse of an authorization.
The ambassador executes the request and obtain and store that result.
The ambassador returns notification to the caller about the result of the request.
The caller of the ambassador completes its work and notifies that it has correctly updated its state information from the ambassador with another call
The ambassador updates its event source with the information that it has been confirmed.

Separately, the ambassador implements a queue listener that has knowledge of how to compensate for the error that uses the following flow:

It uses the ambassador event store to determine the state of the request using our payment example

a. Is the payment authorized?

i.  Did the original caller acknowledge it had not recorded the capture?

    1.  Rverse the authorization

    2.  Update the event state as reversed

    3.  On successful completion mark the message as completed

    4.  Mark the message complete

b. Update the authorization status as failed if not already failed

i.  Mark the message complete

The goal here is to create the ability to correctly track the state of the ambassador’s 3^rd^ party interaction as either complete and tracked in the calling system or complete but abandoned in the ambassador’s state. And creating an assumption of failure that can always be reversed by working with a backup compensation system that is not dependent on the Saga’s compensation being successful.

Sidecar Compensation

Unless there is no alternative through the use of middleware, packaged software assets, or means to avoid the use of sidecars, side car compensation is incredibly difficult to deal with. In my opinion then should only be used to bridge to better over all architecture design by creating middleware or libraries to handle functions such as proxies, observation, or security.

Compensation Patterns for Microservices: The Tech Debt You Didn't Know You Had