As you know, kubelet is a primary node component in Kubernetes that performs a number of critical tasks. In particular, kubelet is responsible for:
Another important kubelet’s task we wanted to discuss in this article is the “primary node agent’s” ability to evict Pods when a node runs out of resources. The kubelet plays a crucial role in preserving node stability when compute resources such as disk, RAM, or CPU are low. It’s useful for Kubernetes administrators to understand best practices for configuring out-of-resource handling to make node resources flexible while preserving the overall fault tolerance of the system and stability of the critical system processes.
As we have mentioned, kubelet can evict workloads from a node to free up resources for other Pods and/or system tasks like the container runtime or the kubelet itself. However, how does the kubelet decide that the resources are low?
Kubelet determines when to reclaim resources based on the eviction signals and eviction thresholds. An eviction signal is the current capacity of a system resource like memory or storage. In its turn, an eviction threshold is the minimum value of this resource that should be maintained by the kubelet.
In other words, each eviction signal is associated with a certain eviction threshold that tells the kubelet when to start reclaiming resources. At this time, the following eviction signals are supported.
nodefs.available— The nodefs is a filesystem used by the kubelet for volumes, daemon logs, etc. By default, the kubelet starts reclaiming node resources if the nodefs.available < 10%.
nodefs.inodesFree— A signal that describes the state of the nodefs inode memory. By default, the kubelet starts evicting workloads if the nodefs.inodesFree < 5%.
imagefs.inodesFree— The state of the imagefs inode memory. It has no default eviction threshold.
The above-described eviction thresholds are quite sensible defaults. However, users can configure their custom eviction thresholds by setting appropriate flags on the kubelet binary. These user-defined thresholds can change the default kubelet eviction behavior.
At this time, Kubernetes supports hard and soft eviction thresholds.
If a hard eviction threshold is reached, the kubelet starts reclaiming resources immediately, without any grace period. In contrast, soft eviction thresholds include a user-defined grace period that should expire before the kubelet starts reclaiming any resources.
You can define a hard eviction threshold with the --eviction-hard flag on the kubelet binary. For example, kubelet --eviction-hard=memory.available<1Gi would tell the kubelet to start reclaiming resources when the node’s memory.available is below 1Gi.
If you want to allow for a grace period before eviction, you can use the --eviction-soft flag in combination with the --eviction-soft-grace-period flag. For example, kubelet --eviction-soft=memory.available<2Gi and kubelet --eviction-soft-grace-period=1m30s will make the eviction threshold hold for 90 seconds before triggering the eviction threshold.
Users can also specify the maximum grace period allowed by setting the --eviction-max-pod-grace-period in seconds.
The kubelet reclaims resources at the expense of the end-user Pods as a last resort. It first tries to reclaim such resources as unused container images or dead Pods.
The kubelet reclaims node resources differently if a node has a dedicated imagefs filesystem along with the nodefs filesystem. In this case, if the nodefs reaches the eviction threshold, the kubelet deletes all dead Pods and their containers. Correspondingly, if the imagefs reaches the eviction threshold, the kubelet removes all unused container images.
If there is no imagefs used, the kubelet first deletes all dead Pods and their containers and then removes all unused images. For more information about this process, see this article from the Kubernetes documentation.
If reclaiming containers images, dead Pods, and other resources does not lead out of the resource starvation, the kubelet starts deleting end-user Pods as a last resort. The kubelet decides which end-user Pods to evict based on the Pod’s QoS class, Pod Priority, and a number of other parameters discussed below. Before describing this process, let’s recall the basic QoS classes in Kubernetes.
As you may already know from our previous tutorials, in Kubernetes, Pods can be Guaranteed, Burstable, or Best-Effort.
This QoS model is implicitly used by the kubelet in its Pod ranking scheme. In general, the kubelet ranks candidates for eviction using the following rules:
Given these rules, the kubelet evicts end-user Pods in the following order:
If the amount of the resources the kubelet reclaims is small, the system can repeatedly hit eviction thresholds. This is not the desired behavior because it can lead to poor scheduling decisions and frequent evictions of Pods. To avoid this scenario, users can set a per-resource minimum reclaim level using --eviction-minimum-reclaim flag on the kubelet binary.
For example, take a look at the kubelet configuration below:
This --eviction-minimum-reclaim setting ensures that the minimum amount of the nodefs storage available after the reclaim is 3Gi and the minimum amount of the imagefs storage available is 202 Gi. Thus, the configuration above ensures that the system has enough resources available to avoid hitting eviction thresholds very frequently.
Another potential issue you can encounter with the poor out-of-resource handling configuration is the oscillation of node conditions. When the kubelet receives an eviction signal, the latter is mapped to a corresponding node condition. For example, when the memory.available eviction threshold is hit, the kubelet assigns a MemoryPressure node condition to the node. This condition is associated with the corresponding taint that prevents new Pods from being scheduled on a node with the MemoryPressure node condition. You can find more information about node conditions in our earlier article.
However, if you use a soft eviction threshold with a long grace period, node conditions can oscillate between true and false within the grace period. This may lead to eviction indeterminacy and, therefore, poor scheduling decisions. To avoid this situation, you can use -- eviction-pressure-transition-period flag on the kubelet, which defines how long the kubelet has to wait before meeting the eviction condition.
Now we’ll illustrate how to configure out-of-resource handling for your K8s cluster. Let’s imagine a simple scenario where only node RAM is considered. Assume that our node’s memory capacity is 10Gi of RAM. We would like to reserve 10% of total memory for system daemons like kernel, kubelet, Docker, etc. We also want to evict Pods at 95% of memory utilization.
The kubelet is launched with default eviction threshold and has no system-reserved set. We need to explicitly set several flags on the kubelet to enable the behavior we want.
To achieve our goal, we need to set the following flags on the kubelet:
As you see, system-reserved is set to 1.5Gi although, intuitively, it should be set to 10%=1Gi. However, the “System reserved” should include the amount of memory covered by the eviction threshold (1Gi + .5Gi).
Depending on how you provision the K8s cluster, the kubelet flags can be set differently. For example, if you plan to provision your K8s cluster with Kops, run kops edit cluster $NAME to open the editor with the cluster configuration. If this is a VI editor, enter the Insert mode by pressing “I” to edit the file. The kubelet flags for the above out-of-resource handling policy should look as follows:
That’s it! In this tutorial, we discussed some useful Kubernetes administration practices for customizing the kubelet out-of-resource management in Kubernetes. The platform allows administrators to set custom eviction thresholds and eviction grace periods to decide on which conditions are considered to be dangerous for node stability. With that freedom, however, comes much responsibility. Kubernetes ships with the sensible out-of-resource management defaults. Therefore, you should be cautious when setting eviction thresholds too high or making eviction grace periods too long.
If you enjoyed this article you might be interested in watching the following webinar! Click on the banner below to watch it now.