Designing a Microservices Architecture for Failure

microservices allow you to achieve graceful service degradation as components can be set up to fail separately

we need to keep in mind that provider services can be temporarily unavailable by broken releases, configurations, and other changes as they are controlled by someone else and components move independently from each other.

服务降解

In a microservices architecture, services depend on each other. This is why you should minimize failures and limit their negative effect. To deal with issues from changes, you can implement change management strategies and automatic rollouts.

健康检查和负载平衡

To avoid issues, your load balancer should skip unhealthy instances from the routing as they cannot serve your customers’ or sub-systems’ need.

Self-healing

In most of the cases, it is implemented by an external system that watches the instances health and restarts them when they are in a broken state for a longer period.

Implementing an advanced self-healing solution which is prepared for a delicate situation - like a lost database connection - can be tricky. In this case, you need to add extra logic to your application to handle edge cases and let the external system know that the instance is not needed to restart immediately.

failover caching

Failover caches usually use two different expiration dates; a shorter that tells how long you can use the cache in a normal situation, and a longer one that says how long can you use the cached data during failure.

retry logic

In distributed system, a microservices system retry can trigger multiple other requests or retries and start a cascading effect

Using a unique idempotency-key for each of your transactions can help to handle retries.

Rate Limiters and Load Shedders

A fleet usage load shedder can ensure that there are always enough resources available to serve critical transactions

Fail Fast and Independently

bulkhead pattern

We can say that achieving the fail fast paradigm in microservices by using timeouts is an anti-pattern and you should avoid it.Instead of timeouts, you can apply the circuit-breaker pattern that depends on the success / fail statistics of operations.

Bulkheads

By applying the bulkheads pattern, we can protect limited resources from being exhausted.

Circuit Breaks

A circuit breaker opens when a particular type of error occurs multiple times in a short period

Testing for Failures

ChaosMonkey for test

Key Takeways

Dynamic environments and distributed systems - like microservices - lead to a higher chance of failures.
Services should fail separately, achieve graceful degradation to improve user experience.
70% of the outages are caused by changes, reverting code is not a bad thing.
Fail fast and independently. Teams have no control over their service dependencies.
Architectural patterns and techniques like caching, bulkheads, circuit breakers and rate-limiters help to build reliable microservices.