Fault Tolerance with Resilience4j
Security with Keycloak, OAuth2, and OpenID Connect
Fault Tolerance in Microservices
In a distributed system, issues can happen at any time. Instead of showing errors to the client, we should return a result, either from cache or default values, based on the app logic. This is exactly where fault tolerance patterns become important.
For that reason, we will be using resilience4j. This is a dependency that is widely used in distributed architecture to manage fault tolerance problems.
It provides several resilience design patterns:
- Circuit Breaker
- Retry
- Rate Limiter
- Time Limiter
- Bulkhead
- Cache
More on this here:
https://resilience4j.readme.io/docs/getting-started
Circuit Breaker Design Pattern
In this pattern, when service A calls service B and everything is working correctly, the circuit remains closed, and the communication happens normally without interruption.
However, when a problem occurs, the circuit breaker intervenes to manage this communication. It acts as a proxy between service A and service B, controlling the calls and applying the rules defined by the design pattern to handle the failure properly.
Rate Limiter Pattern
When we create APIs but we want to limit access to the API, we can give access to the client using an API key. However, we don't want requests to surpass a certain number of calls per day, per hour, per week, etc. This helps us respect the capacity of our system.
Time Limiter Design Pattern
When we want to call a service, this service that is remote can take time. So we configure a timeout. Once we send the request, we wait for a defined duration, and if the time limit is surpassed, we stop waiting and use an alternative option (fallback).
Retry Design Pattern
It is a mechanism where, when a service calls another service and there is an issue, it waits for a defined duration and then tries again. This process can be configured with a number of attempts and a waiting time before we finally consider it a real failure.
Bulkhead Pattern
This pattern is used to control concurrent requests. If we don't want the number of simultaneous calls to surpass a certain limit, we configure a maximum number of concurrent requests to protect the system from overload.
Cache pattern
resilience4j also offers a cache. when the cache is used in a method, the result is saved there. with it, we can easily return cached data, and it can also work together with the circuit breaker.
where do we find the result whenlets start with circuit breaker:
the idea is that when service A wants to call service B, in the absence of a circuit breaker, if there is an issue, an exception is returned. if we do nothing, what happens is that in the app we have to manage the exception within the app, for example by showing an error to the user.
the solution is to use a circuit breaker. it is a proxy: when we want to call service B, the call is done through the circuit breaker. then the circuit breaker will try the call first, and when we have a response without error, we pass it normally. but if there is an issue, an exception happens, and in that case the circuit breaker manages the state.
it starts in closed state (no issue). when failures reach a threshold, it goes to open state. and when it is open, whenever we ask it to get a result from the service, it will return a fallback result (for example from cache) to return a response while it stays open. during that time, and while the duration is not surpassed, it does not call the service.
once the waiting duration is surpassed, it tries calling the service again. if it fails again, it repeats the same steps and stays open again. but once it tries after the duration ended and the service works, it goes to half-open state. this is where it starts calling the service sometimes to see if there are still issues in the requests. if the failures surpass the threshold, it goes back to open. otherwise, if the requests work without errors and stay below the threshold, it goes back to closed state.
so globally it manages 3 states: closed, open, half-open.
In your project you will have to add a dependency:
- spring cloud starter circuitbreaker resilience4j
and you will also need to add the annotation to configure the circuit breaker.
so in our billing service, if we want a bill, the service will need to call the customer service via open feign, which sends the request to the customer service and retrieves the customer data.
now to manage this communication via a circuit breaker, we have to add a method to call a distant service, using the annotation @CircuitBreaker and it expects the following config: the name, and fallbackMethod. this method will call the service, and when there is a failure, it won't return the exception to the user, it will call the fallback method instead.
and the fallback method will accept the path variable and the exception.
let's see this with an example:
the exception is an obligatory field:
when the circuit breaker calls the method, it catches the exception. and when it calls the fallback method, it sends the parameters and the exception, so that we know what kind of error happened or any extra data from the exception.
dummy example :
the essential thing is: when we retry, if there is an error, it retries, and when the number of retries is surpassed, it calls this local method (fallback) that will return the method result.
Configuration
you will need actuator, which we saw in part 1. you will also need to activate the circuit breaker.
if you want to register the circuit breaker in actuator using the health endpoint, and you want /health to return the circuit breaker details, then show-details should be always.
for circuit breaker configuration: there are default values, but we can personalize them. to do that, we configure it like this (where customerService is the name of the circuit breaker):
resilience4j.circuitbreaker.isntances.customerservice.regiter-health-indicator=true
this means: we want to register the health state of this circuit breaker in actuator, so that when you use /health you will see the circuit breaker state.
we also have other params like the event consumer buffer size: it uses a buffer in memory to store a number of calls, and based on that it decides whether to move to open, closed, or half-open.
when would it go from closed to open? we use a threshold:
resilience4j.circuitbreaker.isntances.customerservice.failure-rate-threashold=20
when 20% of the requests fail, then we move to open state.
the minimum number of calls:
resilience4j.circuitbreaker.isntances.customerservice.minimum-number-of-calls= 10
it waits for 10 requests before changing states.
resilience4j.circuitbreaker.instances.customerService.automatic-transition-from-open-to-half-open-enabled=true
this means that going from open to half-open, after surpassing the timeout, it transitions automatically to half-open.
resilience4j.circuitbreaker.instances.customerService.wait-duration-in-open-state=5s
it waits in open state for 5 seconds before going to half-open.
resilience4j.circuitbreaker.instances.customerService.permitted-number-of-calls-in-half-open-state=5
when we are in half-open state, this is how many calls we allow.
resilience4j.circuitbreaker.instances.customerService.sliding-window-size=10
this says that we have a window of 10 requests.
resilience4j.circuitbreaker.instances.customerService.sliding-window-type=count_based
this means that state decisions are based on the number of requests (count-based). there are other types, like time-based.
and for retry configuration:
resilience4j.retry.instances.retrySearchCustomers.max-attempts=15
how many retries we can make.
resilience4j.retry.instances.retrySearchCustomers.wait-duration=5s
how many seconds we wait before retrying again.
let's get hands on:
reminder: what we want as a client is to send a request to the gateway to get the first bill. the gateway will retrieve the address of the billing service from the discovery service. once it gets it, it calls the billing service to get the bill. the billing service will then call the customer service to get the data for that client, and the inventory service to get the product data, before generating the response and sending it back to the client.
if the customer service isn't working, how would we get the customer's data? this is where the circuit breaker should come in handy.
when the service fails and there is an error, we don't want to just display an error. we want to instead retrieve the data, for example from some cache, or return default values depending on the app logic.
start the microservices in this order:
- discovery service (eureka service)
- config service
- customer service
- inventory service
- billing service
- spring cloud gateway
go to your billing service pom.xml and add the following dependencies:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-circuitbreaker-resilience4j</artifactId>
</dependency>
once you reload the maven project, you will get access to the circuit breaker annotation. in your feign project, target the customer rest client and let's add the configuration like this:
now onto configuring the circuit breaker. this should be done at the config repo level, but for demonstration purposes we will use the billing service application.properties.
management.health.circuitbreakers.enabled=true
management.endpoint.health.show-details=always
resilience4j.circuitbreaker.instances.customerServiceCB.register-health-indicator=true
resilience4j.circuitbreaker.instances.customerServiceCB.event-consumer-buffer-size=15
resilience4j.circuitbreaker.instances.customerServiceCB.failure-rate-threshold=20
resilience4j.circuitbreaker.instances.customerServiceCB.minimum-number-of-calls=5
resilience4j.circuitbreaker.instances.customerServiceCB.automatic-transition-from-open-to-half-open-enabled=true
resilience4j.circuitbreaker.instances.customerServiceCB.wait-duration-in-open-state=5s
resilience4j.circuitbreaker.instances.customerServiceCB.permitted-number-of-calls-in-half-open-state=5
resilience4j.circuitbreaker.instances.customerServiceCB.sliding-window-size=10
resilience4j.circuitbreaker.instances.customerServiceCB.sliding-window-type=count_based
To know the state of your services and the circuit breaker, you can check the state of your CB here: http://localhost:8888/billing-service/actuator/health
as we can see, the status of the circuit breaker is still CLOSED because the service is working.
"circuitBreakers": {
"status": "UP",
"details": {
"customerServiceCB": {
"status": "UP",
"details": {
"failureRate": "-1.0%",
"failureRateThreshold": "50.0%",
"slowCallRate": "-1.0%",
"slowCallRateThreshold": "100.0%",
"bufferedCalls": 1,
"slowCalls": 0,
"slowFailedCalls": 0,
"failedCalls": 0,
"notPermittedCalls": 0,
"state": "CLOSED"
}
}
}
this time, we will stop the customer service, to explicitly create a fault and be able to test our circuit breaker.
launch: http://localhost:8888/billing-service/actuator/health
now launch: http://localhost:8888/billing-service/api/bills and keep refreshing until you reach the threshold. then recheck the actuator health: you will see that the state of customerServiceCB has changed to half-open ("state": "HALF_OPEN").
"circuitBreakers": {
"status": "UNKNOWN",
"details": {
"customerServiceCB": {
"status": "CIRCUIT_HALF_OPEN",
"details": {
"failureRate": "-1.0%",
"failureRateThreshold": "50.0%",
"slowCallRate": "-1.0%",
"slowCallRateThreshold": "100.0%",
"bufferedCalls": 3,
"slowCalls": 0,
"slowFailedCalls": 0,
"failedCalls": 2,
"notPermittedCalls": 0,
"state": "HALF_OPEN"
}
}
}
and that means that we're now returning the default data. we could also retrieve data from the cache by adding the annotation @Cacheable, which will start caching your data when you consult it from your service, so that when the service is down you can retrieve the cached result. but in our case, since it's a different design choice, we will keep using our default values.
now let's start the customer service and then restart the billing service. after that, launch the actuator /health again, it should have gone back to CLOSED state:
"circuitBreakers": {
"status": "UP",
"details": {
"customerServiceCB": {
"status": "UP",
"details": {
"failureRate": "-1.0%",
"failureRateThreshold": "50.0%",
"slowCallRate": "-1.0%",
"slowCallRateThreshold": "100.0%",
"bufferedCalls": 19,
"slowCalls": 0,
"slowFailedCalls": 0,
"failedCalls": 0,
"notPermittedCalls": 0,
"state": "CLOSED"
}
}
}
more on resilience4j: https://resilience4j.readme.io/docs/circuitbreaker
Security:
When you have a microservices architecture, you need to follow best practices for securing distributed services. Generally, we need an authentication and authorization system, such as OAuth2 and OpenID Connect, which use JWT. Keycloak is one of the most widely used tools based on OAuth2 and OpenID Connect.
A user from the UI goes through a gateway. We call an authentication service, retrieve the username and password, and the service verifies the data. It then generates a JWT, which is returned to the web or mobile application. Every time we send a request to a service, we include the JWT, which represents the user session. The JWT is sent to the microservice, which verifies the token signature to retrieve the user session. Based on that, we determine whether the user has the right to access the requested data or not.
To create an authentication system, we have two approaches: stateful and stateless authentication. In the first solution, the session data is saved on the server. In the second solution, the session data is stored inside a token, which is delivered to the client. The session is contained within that token and is sent with every request.
Stateful:
User sends login data, the service looks for his identity in the database and retrieves his data and role. Then we verify if the password is correct, and we save the session in memory, including his username and role, to know what actions he is allowed to perform.
Once the session is created, it has a session ID, usually a unique UUID. Since the session is stored on the server, we send the session ID to the client in the HTTP response, and it is saved as a cookie. So in a stateful solution, the session is stored on the server, and the user stores the session ID on the client side.
Whenever the user sends a request from his machine, the browser automatically sends the cookies, including the session ID. The server then looks at the list of sessions, checks if the session is still open, and based on that identifies the user and his role. To determine authorization, if the user has the right to perform the action, we respond with status code 200; otherwise, we return 403.
Stateful authentication is very practical for server-rendered HTML applications; it is practical and secure. However, when the backend and frontend are separated, it is not very practical. This is where we use stateless authentication.
Stateless:
User gives username and password, we go to the database to retrieve the user's data, and if the password is correct, we generate a token. It is a chain of characters where we store the user session. One of the most popular tokens is JWT (JSON Web Token). It is in JSON format and contains a lot of information such as the expiration date, role, and username, these are called claims.
We then generate a hash, called a digital signature, which guarantees that this token cannot be modified. If someone tries to change the token, we will detect it using this signature.
Once the token is generated, we deliver it to the client in the response. The client application is responsible for storing the token, which contains the user session, for example in session storage.
Now I am authenticated, and the server does not have to save anything. That is why it is called stateless, meaning no server-side memory. You ask me to authenticate, I authenticate, you give me the session (in the form of a token), and I store it. The server does not remember anything and does not store the session at its level.
Each time the client sends a request, it must include an Authorization header containing the token. When the server receives the token, it verifies the signature. It must know the key (public or private, depending on the algorithm) to verify the signature. If the signature is correct, we retrieve the user session from the token. Based on the role inside the token, we determine whether the user has the right to perform the operation. Otherwise, the response is 403.
In application security, we distinguish between stateful and stateless authentication. Stateless authentication is widely used in distributed systems.
The JWT:
JWT is very popular and largely used. it is a standard that defines a compact and autonomous token that doesn't require any third party to consult the stored data. it has 3 parts: header, payload, signature.
- the header contains the algorithm used to calculate the signature. for example RSA uses a private and public key: the private key is used to sign, and the public key is used to verify.
- the payload is a JSON object that contains a group of claims.
standard (registered) claims:
- subject: username
- issuer: app address that generated the token
- audience: the public client
- issued at: when the token was generated
- expiration: when the token will expire
- not before: when the token becomes valid
private claims: you can add whatever other data you want, depends on the app needs.
so the user session here is the payload
private claims: you can add whatever other data you want, depending on the app needs. so the user session here is the payload.
the signature is calculated through the algorithm: we take the header
- payload and give it to the algorithm with the private key to generate the hash. the signature depends on the header and payload, so if anything changes, the signature won't match.
so JWT is: header.payload.signature. when the user sends the JWT to the microservice, it has to verify the signature.
example:
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWUsImlhdCI6MTUxNjIzOTAyMn0.KMUFsIDTnFmyG3nMiGM6H9FNFUROf3wh7SmqJp-QV30
we retrieve the header eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9 to know the algorithm. then, assuming we have the public key, we use the algorithm with the header + payload to calculate the signature. to know if the data hasn't been tampered with, we compare the signature we generated with the signature we received in the JWT.
for more details: https://www.jwt.io/introduction#why-use-json-web-tokens
vulnerability types: CSRF
this type of vulnerability forces a user who is authenticated in an app to perform an operation without realizing it. how?
let's say our client wants to access his bills. he signs in using his credentials, the app retrieves his username and password from the database and compares them with the stored data. if they match, it creates a session. the session is now open and has a session ID. this session ID is sent in the HTTP response and stored in cookies. the issue here is the cookies.
let's say you're authenticated, and a third party manages to use your cookies. they send you a URL disguised as a coupon email, for example. if you click on it, the URL sends a request to the bank server to transfer money from your account to someone else's account.
the server verifies if the session is open. since the cookies are automatically sent with the request, the server finds the session ID, checks that you are authenticated and have the right to perform the operation, and then executes the request, even though you did not intentionally make it.
more details: https://developer.mozilla.org/en-US/docs/Web/Security/Attacks/CSRF
so how do we mitigate against this in a stateful solution?
whenever the client requests a page from the server, the server should generate a CSRF token. it stores this token in the session and also sends it inside the form as a hidden field. this hidden field contains the CSRF token and ensures that the page comes from the same server.
let's say you submit a form. how do we know it's not a CSRF attack? when the request is received, the server reads the hidden field, retrieves the CSRF token from the form, and compares it with the one stored in the session. if they match, we know the request comes from the same legitimate page. that's why when you work with spring boot, it automatically adds this CSRF field.
but when we use a stateless solution, we don't use this mechanism because we don't rely on cookies. this vulnerability mainly happens because of cookies in stateful authentication.
a session also has a timeout. even if your session is still open, if a malicious app on your machine tries to send a request to the server using your cookies, it will be using your active session. the server will not detect it as malicious because it only checks the session ID.
*reminder about cross-origin resource sharing (CORS) *
when the browser sends a request to a frontend server, you are requesting a page. from that page, if you want to send a request (for example using the POST method) to another domain, the browser does not allow it directly if the domain is different.
before sending the actual request, the browser sends a preflight request using the HTTP method OPTIONS to ask the server which types of communication are allowed.
the server responds with headers such as:
- Access-Control-Allow-Origin: if this header is *, it means requests from any domain are allowed. otherwise, if the domain is not allowed, the browser will show a CORS error.
- Access-Control-Allow-Headers: this lists which headers are allowed in the request.
so the server must explicitly allow the origin and headers. if you set *, all kinds of headers and domains are allowed.
OAuth2:
open-authorization protocol: it is made to delegate authorization to a third party, so you could use Google, for example, to get authenticated. if you choose Google, it will open a new window to allow you to sign in, and afterwards you're redirected back to the app.
you have the client app that is trying to access the backend, which is a protected resource requiring authentication. instead of the backend handling authentication directly, it redirects you to the authorization server and sends some parameters to identify which client made the request. you need to create a client ID, and the callback URL will contain the URL of your server so that after authentication it can redirect back to it.
what does the authorization server do? it shows a login form that you fill in with your credentials. once submitted, the authorization server authenticates you. we use OpenID Connect for this, which means authentication can be done by a third party that manages user data.
after successful authentication, the authorization server generates an authorization code, which expires quickly (usually in a few seconds). this code is sent to the resource server, which contains the protected resource. the resource server receives the authorization code and verifies that it is valid and not expired by contacting the authorization server. if it is valid, the resource server retrieves the token (in the case of OpenID Connect, this is usually a JWT), which contains the user session. then the resource server opens the session and grants access to the resource.
so authentication is handled through the authorization server, and once the user is authenticated, they are redirected back to the app and given access to the resource. that is the basic principle.
OpenID Connect
this uses the ID token and includes two main tokens: the access token and the refresh token. what are they?
to understand this, what is the issue with JWT? when we generate a JWT, we must add an expiration date. let's say we give it a validity of one year. this means the user will have access for one year. if we want to revoke that access, we would have to wait until the token expires, which is not practical.
that's why OpenID Connect uses two tokens:
- the access token, which has a short lifetime and is used to access protected resources.
- the refresh token, which has a longer lifetime and is used to generate new access tokens.
when accessing a resource, the client uses the access token, and the application reads the user role and permissions from it. when the access token expires, the refresh token is used to request a new access token. however, once the refresh token expires, the user must authenticate again.
so what is the advantage? even though the refresh token has a longer lifetime, each time a new access token is requested, the system can re-check the user's roles and permissions. this allows the system to verify whether the user is still authorized to access the environment before issuing a new access token.
Keycloak:
is an open source technology developed with Java that lets us do 3 things:
- identity management
- authentication using OpenID Connect
- delegation of authorization using the OAuth2 protocol
once you start it, it uses an H2 database by default. later we can replace it with a different database like PostgreSQL, but for development purposes we can keep H2.
you have a frontend and a backend app, and to protect them we use Keycloak adapters. for the backend, we use Spring Security, which lets us secure the backend. for the frontend, we use Keycloak adapters, which are libraries that make securing the app easier.
when you try to access the frontend without being authenticated, it redirects you to Keycloak, which shows an authentication window. after you provide your credentials, it performs the authentication and generates an authorization code. you are then redirected back to your frontend, which communicates with the backend. the backend requires a token, retrieves the JWT, and opens the session.
when the browser requests a resource from the backend, the frontend sends the JWT to the backend. once the backend receives the JWT, it verifies the signature, which requires the public key. we retrieve this public key from Keycloak. after that, everything follows the same principle as explained before.
so we need to install Keycloak.
pay attention to always install the latest version, because updates usually include security fixes.
how to install Keycloak:
https://www.keycloak.org/getting-started/getting-started-zip
how to start it:
https://www.keycloak.org/getting-started/getting-started-zip#_start_keycloak
after starting it, launch:
http://localhost:8080/
to access the following administration console:
register and sign in.
the first step is to navigate to manage realms and create a realm through the UI. we need it to store our application clients so we can assign roles and control authorization. if you have an LDAP directory (annuaire LDAP), you can configure it instead of managing identities directly inside realms.
once it is created, create a new client:
keep the default options (this is just a demo client), then save and click on create a role:
then create a new user:
create a password:
then click on role mapping, choose assign role, select client role, pick the admin role we previously created, then click assign:
now go to your realm settings and click on the OpenID Connect endpoint configuration:
or simply launch:
http://localhost:8080/realms/{your realm name}/.well-known/openid-configuration
you will get a JSON response. retrieve the token endpoint (token_endpoint) and call it from your API client (I am using HTTPie).
it will generate both:
- the access token, which expires in 5 minutes ("expires_in": 300)
- the refresh token, which expires in 30 minutes ("refresh_expires_in": 1800)
the refresh token is only used to renew access tokens.
if you want to authenticate using the refresh token, your request body will look like this:

now that we saw how to authenticate with a refresh token, let's look at how to authenticate using a client secret:
now we will use the client secret, which we retrieve from the client credentials section:

















Top comments (0)