ALL THINGS KUBERNETES

Making Sense of Taints and Tolerations in Kubernetes

In the earlier tutorial, you learned how to assign Pods to nodes in Kubernetes using nodeSelector  and affinity features. As we tried to demonstrate, affinity is a great feature for such use cases as creating dedicated nodes, distributing Pods evenly across the cluster or co-locating Pods on the same machine. In this tutorial, we’ll discuss Kubernetes taints and tolerations, —  another feature for the advanced Pod scheduling in Kubernetes.

Taints are used to repel Pods from specific nodes. This is quite similar to the node anti-affinity discussed in the previous blog. However, taints and tolerations take a slightly different approach. Instead of applying the label to a node, we apply a taint that tells a scheduler to repel Pods from this node if it does not match the taint. Only those Pods that have a toleration for the taint can be let into the node with that taint.

However, why would you need this feature if a similar behavior could be achieved with the custom implementation of node anti-affinity using logical operators like NotIn  and DoesNotExist ? As we’ll try to show in this article, taints together with tolerations allow for more fine-grained control over Pods eviction and anti-affinity than custom node anti-affinity with logical operators. Also, you’ll learn how automatic tainting of nodes with certain node conditions in combination with tolerations can be leveraged to control Pod behavior on nodes experiencing such problems as network unavailability, low disk, low memory etc. Let’s get started!

Use Cases for Taints and Tolerations

Before even studying how taints and tolerations work you probably would like to know how can they improve your K8s cluster administration. In general, taints and tolerations support the following use cases:

  • Dedicated nodes. Users can use a combination of node affinity and taints/tolerations to create dedicated nodes. For example, you can limit the number of nodes onto which to schedule Pods by using labels and node affinity, apply taints to these nodes, and then add corresponding tolerations to the Pods to schedule them on those particular nodes. We’ll show how to implement this use case in detail at the end of the article.
  • Nodes with special hardware. if you have nodes with special hardware (e.g GPUs) you want to repel Pods that do not need this hardware and attract Pods that do need it. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule  ) and adding corresponding toleration to Pods that must use this special hardware.
  • Taint-based Evictions. New Kubernetes versions allow configuring per-Pod eviction behavior on nodes that experience problems. Taint-based evictions will be discussed in detail below.

Working with Taints and Tolerations

The process of adding taints and tolerations to nodes and Pods is similar to how node affinity works.

First, we add a taint to a node that should repel certain Pods. For example, if your node’s name is host1 , you can add a taint using the following command:

The taint has the format <taintKey>=<taintValue>:<taintEffect> . Thus, the taint we just created has the key “special“, the value “true“, and the taint effect NoSchedule . A taint’s key and value can be any arbitrary string and the taint effect should be one of the supported taint effects such as NoSchedule , NoExecute , and PreferNoSchedule .

Taint effects define how nodes with a taint react to Pods that don’t tolerate it. For example, the NoSchedule  taint effect means that unless a Pod has matching toleration, it won’t be able to schedule onto the host1 . Other supported effects include PreferNoSchedule  and NoExecute . The former is the “soft” version of NoSchedule . If the  PreferNoSchedule  is applied, the system will try not to place a Pod that does not tolerate the taint on the node, but it is not required. Finally, if the NoExecute  effect is applied, the node controller will immediately evict all Pods without the matching toleration from the node.

You can verify that the taint was applied by running kubectl describe nodes <your-node-name>  and checking the Taints section of the response:

Awesome! As you see, our new taint special=true:NoSchedule  was successfully added.

If you don’t need a taint anymore, you can remove it like this:

In this command, you specify a taint key with a corresponding effect to remove.

Adding Tolerations

As you already know, taints and tolerations work together. Without a toleration, no Pod can be scheduled onto a node with a taint. That’s not what we trying to achieve! Let’s now create a Pod with a toleration for the taint we created above. Tolerations are specified in the PodSpec:

As you see, the Pod’s toleration has the key “special“, the value “true“, and the effect “NoSchedule“, which exactly matches the taint we applied earlier. Therefore, this Pod can be scheduled onto the host1 . However, this does not mean that the Pod will be scheduled onto that exact node because we did not use node affinity or nodeSelector .

The second Pod below can be also scheduled onto a tainted node although it uses the operator “Exists” and does not have the key’s value defined.

This demonstrates the following rule: if the operator  is Exists  the toleration matches the taint if keys and effects are the same (no value  should be specified). However, if the operator  is Equal , the toleration’s and taint’s values should be also equal.

The image below illustrates two tolerations using the operator:"Equal"  . Because we use this operator, the toleration’s and taint’s keys, values, and effects all should be equal. Therefore, only the first Pod that has the same key, value, and effect as the taint can be scheduled onto the host1 .

Taints and Tolerations

Some Special Cases

Let’s take a look at another toleration example:

As you see, the key , value , and effect  fields of this Pod’s toleration are empty. We just used the operator: Exists . This toleration will match all keys, values, and effects. In other words, it will tolerate any node.

Let’s look at yet another example:

The toleration with the empty effect  in this Pod will match all effects with the key “special”.

Working with Multiple Taints and Tolerations

Kubernetes users can set multiple taints on nodes. The process of matching tolerations with these taints then works as a filter. The system will ignore those taints for which the tolerations exist and make un-ignored taints have the indicated effect on the Pod.

Let’s illustrate how this works by applying several taints to our node:

Next, let’s specify two tolerations in the Pod:

This Pod tolerates the first and the second taint but does not tolerate the third taint with the effect NoSchedule . Even though the Pod has two matching tolerations, it won’t be scheduled onto the host01 . The Pod, however, will continue running on this node if it was scheduled before the taint is added because it tolerates the NoExecute  effect.

NoExecute Effect

The taint with the NoExecute  effect results in the eviction of all Pods without a matching toleration from the node. When using the toleration for the NoExecute  effect you can also specify an optional tolerationSeconds  field. Its value defines how long the Pod that tolerates the taint can run on that node before eviction and after the taint is added. Let’s look at the manifest below:

If this Pod is running and a matching taint is added to the node, it will stay bound to the node for 3600 seconds. If the taint is removed by that time, the Pod won’t be evicted.

In general, the following rules apply for the NoExecute  effect:

  • Pods with no tolerations for the taint(s) are evicted immediately.
  • Pods with the toleration for the taint but that do not specify tolerationSeconds  in their toleration stay bound to the node forever.
  • Pods that tolerate the taint with a specified tolerationSeconds  remain bound for the specified amount of time.

TaintNodesByCondition Feature and Taints Based Evictions

Kubernetes 1.6 introduced alpha support for representing node conditions. The node controller adds the conditions  field that describes the status of all Running  nodes. This feature is very useful for monitoring node health and configuring Pod’s behavior when a certain condition is met.

This feature is important for our discussion of taints and tolerations because in the recent versions of Kubernetes nodes are automatically tainted by the node controller based on their conditions and users can interact with these conditions to change the default Pod scheduling behavior. This feature is known as TaintNodesByCondition .

Since Kubernetes 1.8 up to Kubernetes 1.11, TaintNodesByCondition  feature is available as alpha. This means that it is disabled by default in these versions. However, administrators can use feature gates to enable TaintNodesByCondition  in these releases. Feature gates are a set of key=value  pairs that allow disabling/enabling alpha or experimental features.

To enable TaintNodesByCondition  in Kubernetes 1.8-1.11, you can use --feature-gates  command line flag on each component (e.g kubelet) to turn a feature on or off. Ideally, you should set the feature gate --feature-gates=TaintNodesByCondition=true  on ApiServer, Scheduler and Controller manager.

Since Kubernetes 1.12, TaintNodesByCondition  was promoted to Beta. It is now set to true by default so you need to use --feature-gates  command to disable it if needed.

If TaintNodesByCondition  is enabled, the node controller will automatically add a taint depending on the node condition. For example, if the node is out of disk for some reason, the node lifecycle controller will add the node.kubernetes.io/out-of-disk  taint to the node to prevent it from attracting new Pods. The same checks are made against a node’s network availability, memory, and other critical scheduling parameters.

Correspondingly, the user can choose to ignore some of the Node’s problems (represented as Node conditions) by adding appropriate Pod tolerations. Note that TaintNodesByCondition  only taints nodes with NoSchedule  effect. NoExecute  effect is controlled by TaintBasedEviction  which is a beta feature and enabled by default since version 1.13 (we’ll discuss it later).

Currently, the following conditions and taints are supported:

  • OutOfDiskTrue  if there is insufficient free space on the node for adding new Pods. The taint node.kubernetes.io/out-of-disk  is added if the condition’s value is True .
  • ReadyFalse  if the node is not healthy and is not accepting Pods and is Unknown  if the node controller has not heard from the node in the last node-monitor-grace-period  (default is 40 seconds). A taint node.kubernetes.io/not-ready  is added if this condition is False  and the taint node.kubernetes.io/unreachable  is added If the condition is Unknown
  • MemoryPressureTrue  if the node memory is low; otherwise False . The taint node.kubernetes.io/memory-pressure  is added if the value is True .
  • PIDPressureTrue  if there are too many processes on the node; otherwise False .
  • DiskPressureTrue  if the disk capacity is low; otherwise False . The taint node.kubernetes.io/disk-pressure  is added if the condition is True .
  • NetworkUnavailableTrue  if the network for the node is not correctly configured, otherwise False . The taint node.kubernetes.io/network-unavailable  is added if the condition is True .

There are some other taints too. For example, node.kubernetes.io/unschedulable  is added if the node is unschedulable and node.cloudprovider.kubernetes.io/uninitialized  is added to identify that the node is unusable until a controller from the cloud-controller-manager  initializes this node.

You can check the node conditions by running:

This response suggests that the Node is healthy.

As we’ve mentioned earlier, users can choose to ignore certain node problems by adding appropriate Pod tolerations. If your cluster has TaintNodesByCondition  enabled, you can then use tolerations to allow Pods to be scheduled onto nodes with these taints. For example:

This Pod will be scheduled onto nodes even if they have the taints   node.kubernetes.io/network-unavailable  and node.kubernetes.io/notReady  applied by the node lifecycle controller.

Taint-based Evictions

Another useful feature for advanced Pod scheduling is taint-based evictions introduced in Kubernetes 1.6 as alpha. Similarly to TaintNodesByCondition , TaintBasedEvictions  allows automatically adding taints to nodes based on node conditions. If this feature is enabled and NoExecute  effect is used users can change the normal logic of Pod eviction based on the Ready NodeCondition .

In particular, users can choose how long should the Pod stay bound to the node that experiences one of the node conditions mentioned above. For example, you can specify how long the Pod should stay on the unreachable node with the automatically applied node.kubernetes.io/unreachable  taint before it’s evicted.

In Kubernetes 1.6-1.12, users can set --feature-gates=TaintBasedEvictions=true  to enable TaintBasedEvictions . Since 1.13 TaintBasedEvictions  are beta and are enabled by default.

Using taint-based evictions is quite simple. Let’s say you have an application that has accumulated a lot of local state. Ideally, you would want it to stay bound to the unreachable node hoping that the node control will hear from it soon. In this case, you can use the following Pod toleration:

This toleration will make the Pod bound to the unreachable node for 6000 seconds.

Note: if the taint-based evictions are enabled, Kubernetes will automatically add default toleration for node.kubernetes.io/not-ready  with tolerationSeconds=300 . The user-defined toleration can then change this default behavior.

Implementing Dedicated Nodes using Taints and Node Affinity

At the beginning of the article, we mentioned that taints and tolerations are very convenient for creating dedicated nodes. Let’s implement this use case with the example below.

Step 1: Add taints

The first step is to add a taint to one or more nodes. The Pods that are allowed to use these nodes will tolerate the taint. We attach the same taint dedicated=devs:NoSchedule  to a couple of nodes to mark them as dedicated nodes for the Devs team.

Step 2 Add Label to a Node

Next, we need to attach a label to the same set of nodes (we’ll use the same key/value pair for the node label as we used for the taint). Thus, Pods that tolerate the taint must also use the node affinity or nodeSelector  matching the label to be able to run on those nodes. Let’s go ahead and attach the label to both nodes.

Step 3 Add Tolerations and Affinity

Finally, we need to create a Pod with the matching toleration for the above-created taint and the appropriate node affinity’s nodeSelectorTerms  that matches the node’s label we created in the second step. The manifest will look something like this:

As you see, we used “hard” requiredDuringSchedulingIgnoredDuringExecution  node affinity to ensure that the Pod is scheduled only onto the nodes with the “dedicated:devs” label. We also specified one toleration with the key “dedicated“, the value “devs” and the effect “NoSchedule“. This toleration matches the taints of two nodes we attached above.

Since this Pod also has the affinity label selector that matches the node label, it will be scheduled only onto those two nodes that we created. This will make these nodes dedicated for the Pod. The Devs team can then create Pods with this label and toleration to ensure that they use only two nodes allocated to them by the IT department.

Conclusion

That’s it! Now your study of advanced Pod scheduling in Kubernetes is complete. In the course of these two tutorials, you’ve learned such advanced scheduling concepts as nodeSelector , affinity, and taints and tolerations. Combining these features together allows implementing interesting use cases like dedicated nodes we discussed above. Also, the knowledge of the recently added Kubernetes features such as taint-based evictions and TaintNodesByCondition  give you a powerful tool to control how Pods are evicted from and scheduled onto the nodes based on their health conditions.