As you lean into your Kubernetes (k8s) cluster’s self-healing capabilities, one of the first concepts you’ll encounter is that of readiness, liveness, and startup probes. These can be a little confusing because there is a bit of overlap between them. There’s also a lot of risk in getting your configuration wrong, so it’s an area you’ll want to put significant thought towards.
What are probes in this context?
k8s probes are effectively abstractions of health checks. Each type has a different function, but they all provide your cluster with feedback on how it should handle a specific container when it is in various unhealthy/ready states.
Before you slap a bunch of probes on your deployment configuration, it’s important to understand the behavior of each probe type and whether you actually need it or not. Probes can affect one another, so layering them on may have unexpected effects.
You’ll already encounter strange problems in the simplest of dynamic environments, don’t add to your worries by copy/pasting in a bunch of probe configurations think you’re making everything better. Note: Finger pointed at self.
The probes all use identical configuration options (but should not be configured identically!), which does make things easier. They can be based on TCP, HTTP, or EXEC responses, making them very flexible and likely preventing you from having to write a lot of custom health check nonsense.
livenessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 20
Most configuration examples you come across won’t contain startup probes, because 1) they are newish (k8s 1.16), and 2) you likely don’t need one unless your application is old and janky or particularly slow to startup – which may be because it is old and janky.
Startup probes signal your cluster that the container in question has fully started and is ready to receive traffic.
Startup probes keep the other probe types from starting and potentially killing your slow-to-start app container. It’s important to note that they only run at container start and will either kick over responsibility to another probe (if successful) or kill your container (if they timeout).
They also allow more granular control, since they specifically target the pod/container startup, allowing you to configure the other probes to handle different container states.
That being said, readiness probes handle startup detection well and can generally be used in lieu of a startup probe.
You can usually get away with running only readiness probes. This probe type detects when a container is ready to receive traffic, which may include some condition past “Hi, I’ve started.” depending on your app.
Although the name might imply that the probe justs check for the initial ready state and be done, these probes actually run continuously based on the period you set. If they ever meet the configured
failureThreshold, your cluster will stop directing traffic to the container until it is healthy again and meets the
You’ll want to be thoughtful with the timing you use for these as mistiming can lead to pods/containers cycling in and out of ready state.
Also, very important, the health check that your readiness probe targets should not rely on any external services to report health (i.e. a database connection). If your cluster or the external service encounters a network blip that it should recover from gracefully (and quickly), the readiness probe might put things in a state where your entire service is unable to serve traffic (or other bad things) because containers reported as temporarily unhealthy. This post has solid guidance on what to think through when configuring your readiness probes.
Liveness probes detect stalled containers (deadlocks) and trigger a restart of the container. Ideally, instead of configuring a liveness probe, you should just fix whatever is causing deadlocks, cuz deadlocks are bad.
Liveness and readiness probes can compete with one another so be careful in your implementation if you have to use one. Again, don’t use these unless you absolutely have to. Because your containers can get stuck in a restart loop, liveness probes have high potential to cause a cascading failure across your service (and possibly cluster).
Letting a container die and exit 1 is a much better solution. I won’t go as far as to say that you should never use a liveness probe, but you really need to be confident in the need and configuration due to the risk of what might happen if your liveness probe doesn’t behave as expected (or behaves as expected and drives some other unexpected behavior downstream).
Start with readiness probes and only use the other probe types if explicitly needed and you know what you’re doing.