Microservice health check in Kubernetes
TL;DR
Service should provide a standard endpoint for the purpose of health check and monitoring. The specification for the endpoint should conform to the requirements as elaborated in section Requirements.
Background
what is health check
A health check detects the healthy status of a service, reporting whether the service is able to handle requests or whether the service is in a bad state and should be restarted.
Why health check is needed
High availability
There are many cases when a service is started/restarted
- instance/pod restart
- service/deployment up scaling
- rolling update
Under these circumstances, if a request is forwared to a service that is still in the middle of its starting/restarting process, it would probably fail. So we need to make sure a service is healthy to accept requests before adding it to the load balancer(kubernetes service), such that we could reduce the service down time and achieve high availability.
Service stability
Service running for a long period of time may fall into a bad state, in which service is unable to handle requests properly. In this case, service needs to be prohibited from receiving requests, until it is recovered either via restart or manual resurrection. Thus our service in all is stable.
Monitoring
A big part of the DevOps responsibilities is to monitor and maintain the health of running services. If a service goes down, appropriate actions should be undertaken to bring the service back to life. Health check informs the DevOps whether the service is malfunctioning.
Clients of health checks
- Load balancer (Kubernetes service)
- Monitoring service (Prometheus probe)
- Pods (Readiness/Liveness probe)
Downsides of health check
As health check is done periodically, not in a real time manner, there still could be time gap before the unhealthy state is known to the clients. To mitigate the effect of this situation, a reasonable checking period should be set.
Requirements
What should be checked
As the definition of healthy may vary from service to service, depending on the service application logics, there could be many levels of healthy:
- the service is up
- the service is up and the infrastructure service used by the service is healthy
- the service is up, the infrastructure service used by the service is healthy, the dependent microservice is healthy
- the service is up, the infrastructure service used by the service is healthy, the dependent microservice is healthy, smoke tests are passed
Each service may define its own criteria, however the result of these checks should be certain, ie, the service is either healthy or not healthy, there should be no middle state.
How to expose health check to clients
- The service should implement the health check in a RESTful API manner.
- The endpoint is unified as “/health”
How health check respond to clients
Status code
- 200 OK for healthy
- 503 Service Unavailable for unhealthy
Response body
Response body can be empty, however attaching additional information of what is checked and the result of the check is preferred
Security/Access control
The health check should be private and limited to internal access, however if it is open to public access:
- For unauthenticated access, service should provide a basic health info, returning a UP/DOWN status
- For authenticated access, service may provide more detail health info
Implementation
Examples
Service OK
1 2 3 4 5 6 7 |
|
Service Unavailable
1 2 3 4 5 6 7 |
|
Authenticated access
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Libraries
Java
Go
N/A
Client Integration
Kubernetes integration
Please refer to https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
Readiness and liveness probes can be used in parallel for the same container. Using both can ensure that traffic does not reach a container that is not ready for it, and that containers are restarted when they fail.
Readiness Probe
1 2 3 4 5 6 7 8 9 |
|
Liveness Probe
1 2 3 4 5 6 7 8 9 |
|
Prometheus integration
Prometheus keeps polling health API constantly and store the result in its time series database. If health check metrics match a predefined alert rule, a alert will be triggered.
Scrape config
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Service annotation
Add prometheus.io/healthcheck annotation to Kubernetes service so that they could be discovered by the health check job.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Blackbox exporter config
Config a http_2xx module to scrape health api
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|