In the earlier tutorial, you learned how to assign Pods to nodes in Kubernetes using nodeSelector and affinity features. As we tried to demonstrate, affinity is a great feature for such use cases as creating dedicated nodes, distributing Pods evenly across the cluster or co-locating Pods on the same machine. In this tutorial, we’ll discuss Kubernetes taints and tolerations, — another feature for the advanced Pod scheduling in Kubernetes.
Taints are used to repel Pods from specific nodes. This is quite similar to the node anti-affinity discussed in the previous blog. However, taints and tolerations take a slightly different approach. Instead of applying the label to a node, we apply a taint that tells a scheduler to repel Pods from this node if it does not match the taint. Only those Pods that have a toleration for the taint can be let into the node with that taint.
However, why would you need this feature if a similar behavior could be achieved with the custom implementation of node anti-affinity using logical operators like NotIn and DoesNotExist ? As we’ll try to show in this article, taints together with tolerations allow for more fine-grained control over Pods eviction and anti-affinity than custom node anti-affinity with logical operators. Also, you’ll learn how automatic tainting of nodes with certain node conditions in combination with tolerations can be leveraged to control Pod behavior on nodes experiencing such problems as network unavailability, low disk, low memory etc. Let’s get started!
Before even studying how taints and tolerations work you probably would like to know how can they improve your K8s cluster administration. In general, taints and tolerations support the following use cases:
The process of adding taints and tolerations to nodes and Pods is similar to how node affinity works.
First, we add a taint to a node that should repel certain Pods. For example, if your node’s name is host1 , you can add a taint using the following command:
1 2 3 |
kubectl taint nodes host1 special=true:NoSchedule node "host1" tainted |
The taint has the format <taintKey>=<taintValue>:<taintEffect> . Thus, the taint we just created has the key “special“, the value “true“, and the taint effect NoSchedule . A taint’s key and value can be any arbitrary string and the taint effect should be one of the supported taint effects such as NoSchedule , NoExecute , and PreferNoSchedule .
Taint effects define how nodes with a taint react to Pods that don’t tolerate it. For example, the NoSchedule taint effect means that unless a Pod has matching toleration, it won’t be able to schedule onto the host1 . Other supported effects include PreferNoSchedule and NoExecute . The former is the “soft” version of NoSchedule . If the PreferNoSchedule is applied, the system will try not to place a Pod that does not tolerate the taint on the node, but it is not required. Finally, if the NoExecute effect is applied, the node controller will immediately evict all Pods without the matching toleration from the node.
You can verify that the taint was applied by running kubectl describe nodes <your-node-name> and checking the Taints section of the response:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
kubectl describe nodes host1 Name: minikube Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux disktype=ssd kubernetes.io/hostname=host1 node-role.kubernetes.io/master= Annotations: node.alpha.kubernetes.io/ttl=0 volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp: Wed, 09 Jan 2019 12:49:20 +0200 Taints: special=true:NoSchedule |
Awesome! As you see, our new taint special=true:NoSchedule was successfully added.
If you don’t need a taint anymore, you can remove it like this:
1 2 3 |
kubectl taint nodes host1 special:NoSchedule- node "host1" untainted |
In this command, you specify a taint key with a corresponding effect to remove.
As you already know, taints and tolerations work together. Without a toleration, no Pod can be scheduled onto a node with a taint. That’s not what we trying to achieve! Let’s now create a Pod with a toleration for the taint we created above. Tolerations are specified in the PodSpec:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
apiVersion: v1 kind: Pod metadata: name: pod-1 labels: security: s1 spec: containers: - name: bear image: supergiantkir/animals:bear tolerations: - key: "special" operator: "Equal" value: "true" effect: "NoSchedule" |
As you see, the Pod’s toleration has the key “special“, the value “true“, and the effect “NoSchedule“, which exactly matches the taint we applied earlier. Therefore, this Pod can be scheduled onto the host1 . However, this does not mean that the Pod will be scheduled onto that exact node because we did not use node affinity or nodeSelector .
The second Pod below can be also scheduled onto a tainted node although it uses the operator “Exists” and does not have the key’s value defined.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
apiVersion: v1 kind: Pod metadata: name: pod-2 labels: security: s1 spec: containers: - name: bear image: supergiantkir/animals:bear tolerations: - key: "key" operator: "Exists" effect: "NoSchedule" |
This demonstrates the following rule: if the operator is Exists the toleration matches the taint if keys and effects are the same (no value should be specified). However, if the operator is Equal , the toleration’s and taint’s values should be also equal.
The image below illustrates two tolerations using the operator:"Equal" . Because we use this operator, the toleration’s and taint’s keys, values, and effects all should be equal. Therefore, only the first Pod that has the same key, value, and effect as the taint can be scheduled onto the host1 .
Let’s take a look at another toleration example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
apiVersion: v1 kind: Pod metadata: name: pod-2 labels: security: s1 spec: containers: - name: bear image: supergiantkir/animals:bear tolerations: - operator: "Exists" |
As you see, the key , value , and effect fields of this Pod’s toleration are empty. We just used the operator: Exists . This toleration will match all keys, values, and effects. In other words, it will tolerate any node.
Let’s look at yet another example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
apiVersion: v1 kind: Pod metadata: name: pod-2 labels: security: s1 spec: containers: - name: bear image: supergiantkir/animals:bear tolerations: - operator: "Exists" key: "special" |
The toleration with the empty effect in this Pod will match all effects with the key “special”.
Kubernetes users can set multiple taints on nodes. The process of matching tolerations with these taints then works as a filter. The system will ignore those taints for which the tolerations exist and make un-ignored taints have the indicated effect on the Pod.
Let’s illustrate how this works by applying several taints to our node:
1 2 3 4 |
kubectl taint nodes host1 key1=value1:NoSchedule kubectl taint nodes host1 key1=value1:NoExecute kubectl taint nodes host1 key2=value2:NoSchedule |
Next, let’s specify two tolerations in the Pod:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
apiVersion: v1 kind: Pod metadata: name: pod-2 labels: security: s1 spec: containers: - name: bear image: supergiantkir/animals:bear tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" |
This Pod tolerates the first and the second taint but does not tolerate the third taint with the effect NoSchedule . Even though the Pod has two matching tolerations, it won’t be scheduled onto the host01 . The Pod, however, will continue running on this node if it was scheduled before the taint is added because it tolerates the NoExecute effect.
The taint with the NoExecute effect results in the eviction of all Pods without a matching toleration from the node. When using the toleration for the NoExecute effect you can also specify an optional tolerationSeconds field. Its value defines how long the Pod that tolerates the taint can run on that node before eviction and after the taint is added. Let’s look at the manifest below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
apiVersion: v1 kind: Pod metadata: name: pod-2 labels: security: s1 spec: containers: - name: bear image: supergiantkir/animals:bear tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" tolerationSeconds: 3600 |
If this Pod is running and a matching taint is added to the node, it will stay bound to the node for 3600 seconds. If the taint is removed by that time, the Pod won’t be evicted.
In general, the following rules apply for the NoExecute effect:
Kubernetes 1.6 introduced alpha support for representing node conditions. The node controller adds the conditions field that describes the status of all Running nodes. This feature is very useful for monitoring node health and configuring Pod’s behavior when a certain condition is met.
This feature is important for our discussion of taints and tolerations because in the recent versions of Kubernetes nodes are automatically tainted by the node controller based on their conditions and users can interact with these conditions to change the default Pod scheduling behavior. This feature is known as
TaintNodesByCondition .
Since Kubernetes 1.8 up to Kubernetes 1.11, TaintNodesByCondition feature is available as alpha. This means that it is disabled by default in these versions. However, administrators can use feature gates to enable TaintNodesByCondition in these releases. Feature gates are a set of key=value pairs that allow disabling/enabling alpha or experimental features.
To enable TaintNodesByCondition in Kubernetes 1.8-1.11, you can use --feature-gates command line flag on each component (e.g kubelet) to turn a feature on or off. Ideally, you should set the feature gate --feature-gates=TaintNodesByCondition=true on ApiServer, Scheduler and Controller manager.
Since Kubernetes 1.12, TaintNodesByCondition was promoted to Beta. It is now set to true by default so you need to use --feature-gates command to disable it if needed.
If TaintNodesByCondition is enabled, the node controller will automatically add a taint depending on the node condition. For example, if the node is out of disk for some reason, the node lifecycle controller will add the node.kubernetes.io/out-of-disk taint to the node to prevent it from attracting new Pods. The same checks are made against a node’s network availability, memory, and other critical scheduling parameters.
Correspondingly, the user can choose to ignore some of the Node’s problems (represented as Node conditions) by adding appropriate Pod tolerations. Note that TaintNodesByCondition only taints nodes with NoSchedule effect. NoExecute effect is controlled by TaintBasedEviction which is a beta feature and enabled by default since version 1.13 (we’ll discuss it later).
Currently, the following conditions and taints are supported:
There are some other taints too. For example, node.kubernetes.io/unschedulable is added if the node is unschedulable and node.cloudprovider.kubernetes.io/uninitialized is added to identify that the node is unusable until a controller from the cloud-controller-manager initializes this node.
You can check the node conditions by running:
1 2 3 4 5 6 7 8 |
kubectl get nodes -o jsonpath='{range .items[*]}{@.metadata.name}{"\n"}{range @.status.conditions[*]}{@.type}={@.status}{"\n"}{end}{end}' host1 OutOfDisk=False MemoryPressure=False DiskPressure=False Ready=True PIDPressure=False |
This response suggests that the Node is healthy.
As we’ve mentioned earlier, users can choose to ignore certain node problems by adding appropriate Pod tolerations. If your cluster has TaintNodesByCondition enabled, you can then use tolerations to allow Pods to be scheduled onto nodes with these taints. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
kind: Pod apiVersion: v1 metadata: name: test spec: tolerations: - key: node.kubernetes.io/network-unavailable operator: Exists effect: NoSchedule - key: node.kubernetes.io/notReady operator: Exists effect: NoSchedule containers: - name: nginx image: nginx |
This Pod will be scheduled onto nodes even if they have the taints node.kubernetes.io/network-unavailable and node.kubernetes.io/notReady applied by the node lifecycle controller.
Another useful feature for advanced Pod scheduling is taint-based evictions introduced in Kubernetes 1.6 as alpha. Similarly to TaintNodesByCondition , TaintBasedEvictions allows automatically adding taints to nodes based on node conditions. If this feature is enabled and NoExecute effect is used users can change the normal logic of Pod eviction based on the Ready NodeCondition .
In particular, users can choose how long should the Pod stay bound to the node that experiences one of the node conditions mentioned above. For example, you can specify how long the Pod should stay on the unreachable node with the automatically applied node.kubernetes.io/unreachable taint before it’s evicted.
In Kubernetes 1.6-1.12, users can set --feature-gates=TaintBasedEvictions=true to enable TaintBasedEvictions . Since 1.13 TaintBasedEvictions are beta and are enabled by default.
Using taint-based evictions is quite simple. Let’s say you have an application that has accumulated a lot of local state. Ideally, you would want it to stay bound to the unreachable node hoping that the node control will hear from it soon. In this case, you can use the following Pod toleration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
apiVersion: v1 kind: Pod metadata: name: pod-2 labels: security: s1 spec: containers: - name: bear image: supergiantkir/animals:bear tolerations: - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 6000 |
This toleration will make the Pod bound to the unreachable node for 6000 seconds.
Note: if the taint-based evictions are enabled, Kubernetes will automatically add default toleration for node.kubernetes.io/not-ready with tolerationSeconds=300 . The user-defined toleration can then change this default behavior.
At the beginning of the article, we mentioned that taints and tolerations are very convenient for creating dedicated nodes. Let’s implement this use case with the example below.
The first step is to add a taint to one or more nodes. The Pods that are allowed to use these nodes will tolerate the taint. We attach the same taint dedicated=devs:NoSchedule to a couple of nodes to mark them as dedicated nodes for the Devs team.
1 2 3 4 5 6 |
kubectl taint nodes host1 dedicated=devs:NoSchedule node "host1" tainted kubectl taint nodes host2 dedicated=devs:NoSchedule node "host2" tainted |
Next, we need to attach a label to the same set of nodes (we’ll use the same key/value pair for the node label as we used for the taint). Thus, Pods that tolerate the taint must also use the node affinity or nodeSelector matching the label to be able to run on those nodes. Let’s go ahead and attach the label to both nodes.
1 2 3 4 5 6 |
kubectl label nodes host1 dedicated=devs node "host1" labeled kubectl label nodes host2 dedicated=devs node "host2" labeled |
Finally, we need to create a Pod with the matching toleration for the above-created taint and the appropriate node affinity’s nodeSelectorTerms that matches the node’s label we created in the second step. The manifest will look something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
apiVersion: v1 kind: Pod metadata: name: pod-test spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: dedicated operator: In values: - devs tolerations: - key: "dedicated" operator: "Equal" value: "devs" effect: "NoSchedule" containers: - name: just-container image: supergiantkir/animals:bear |
As you see, we used “hard” requiredDuringSchedulingIgnoredDuringExecution node affinity to ensure that the Pod is scheduled only onto the nodes with the “dedicated:devs” label. We also specified one toleration with the key “dedicated“, the value “devs” and the effect “NoSchedule“. This toleration matches the taints of two nodes we attached above.
Since this Pod also has the affinity label selector that matches the node label, it will be scheduled only onto those two nodes that we created. This will make these nodes dedicated for the Pod. The Devs team can then create Pods with this label and toleration to ensure that they use only two nodes allocated to them by the IT department.
That’s it! Now your study of advanced Pod scheduling in Kubernetes is complete. In the course of these two tutorials, you’ve learned such advanced scheduling concepts as nodeSelector , affinity, and taints and tolerations. Combining these features together allows implementing interesting use cases like dedicated nodes we discussed above. Also, the knowledge of the recently added Kubernetes features such as taint-based evictions and TaintNodesByCondition give you a powerful tool to control how Pods are evicted from and scheduled onto the nodes based on their health conditions.