ALL THINGS KUBERNETES

Making Sense of Kubernetes Jobs

A Kubernetes Job is a special controller that can create one or more pods and manage them in the process of doing some finite work. Jobs ensure the pod’s successful completion and allow rescheduling pods if they fail or terminate due to node hardware failure or node reboot. Kubernetes comes with a native support for parallel jobs that allow distributing workloads between multiple worker pods or performing the same task multiple times until reaching the completions count. The ability to reschedule failed pods and built-in parallelism make Kubernetes Jobs a great solution for parallel and batch processing and managing work queues in your applications.

In this article, we’re going to discuss the architecture of and use cases for Kubernetes Jobs and walk you through simple examples demonstrating how to create and run your custom Jobs. Let’s get started!

Why Do Kubernetes Jobs Matter?

Let’s assume you have a task of calculating all prime numbers between 1 and 110 using the bash script and Kubernetes. The algorithm for calculating prime numbers is not that difficult and we could easily create a Pod with a bash command implementing it. However, using a bare pod for this kind of operation might run us into several problems.

First, the node on which your pod is running may suddenly shutdown due to hardware failure or connection issues. Consequently, the pod running on this node will also cease to exist.

Secondly, if we were to calculate all prime numbers ranging from 1 to 10000, for example, doing this in a single bash instance would be very slow. The alternative would be to split this range into several batches and assign those to multiple pods. To take a real-world example, we could create a work queue in some key-value store like Redis and make our worker pod process items in this queue until it gets empty. Using bare pods to accomplish that would be no big deal if we just needed 3-4 pods but would be a harder task if our work queue is large enough (e.g we have thousands of emails, files, and messages to process). Even if this can be done by manually created pods, the first problem is still not solved.

So what is the solution? Enter Kubernetes Jobs! They elegantly solve the above-mentioned problems. On the one hand, Jobs allow rescheduling pods to another node if the one they were running on fails. On the other hand, Kubernetes Jobs support pod parallelism with multiple pods performing connected tasks in parallel. In what follows, we will walk you through a simple tutorial that will teach you how to leverage these features of Kubernetes jobs.

Tutorial

To complete examples in this tutorial, you’ll need the following prerequisites:

  • a running Kubernetes cluster. See Supergiant GitHub wiki for more information about deploying a Kubernetes cluster with Supergiant. As an alternative, you can install a single-node Kubernetes cluster on a local system using Minikube.
  • A kubectl command line tool installed and configured to communicate with the cluster. See how to install kubectl here.

In the example below, we create a Job to calculate prime numbers between 0 and 110. Let’s define the Job spec first:

As you see, the Job uses the batch/v1  apiVersion , which is the first major difference from bare pods and Deployments. However, Jobs use the same PodTemplateSpec  as Deployments and other controllers. In our case, we defined a pod running the ubuntu container from the public Docker Hub repository. Inside the container, we use the bash command provided by the image with a script that calculates prime numbers.

Also, we are using spec.template.spec.restartPolicy  parameter set to Never  to prevent a pod from restarting once the operation is completed. Finally, the field .spec.backoffLimit  specifies the number of retries before the Job is considered to be failed. This might be useful in case when you want to fail a Job after some number of retries due to a logical error in the configuration etc. The default value for the .spec.backoffLimit  is 6.

Let’s save this spec in the job-prime.yaml  and create the job running the following command:

Next, let’s check the status of the running job:

Pay attention to several important fields in this description. In particular, the key Parallelism  has a value of 1 (default) indicating that only one pod was started to do this job. In its turn, the key Completions  tells that the job made one successful completion of the task (i.e prime numbers calculation). Since the pod successfully completed this task, the job was completed as well. Let’s verify this by running:

You can also easily check the prime numbers calculated by the bash script. In the bottom of the job description, find the name of the pod created by the Job (in our case, it is a pod named primes-bwdt7  . It is formatted as [POD_NAME][HASH_VALUE}  ). Let’s check the logs of this pod:

That’s it! The pod created by our Job has successfully calculated all prime numbers between 0 and 110. Above example represents a non-parallel Job. In this type of Jobs, just one pod is started unless it fails. Also, a Job is completed as soon as the pod completes successfully.

However, the Job controller also supports parallel Jobs which can create several pods working on the same task. There are two types of parallel jobs in Kubernetes: jobs with a fixed completions count and parallel jobs with a work queue. Let’s discuss both of them.

Jobs with a Fixed Completions Count

Jobs with a fixed completions count create one or more pods sequentially and each pod has to complete the work before the next one is started. This type of Job needs to specify a non-zero positive value for .spec.completions  which refers to a number of pods doing a task. A Job is considered completed when there is one successful pod for each value in the range 1 to .spec.completions  (in other words, each pod started should complete the task). The jobs of this type may or may not specify the .spec.parallelism  value. If the field value is not specified, a Job will create 1 pod. Let’s test how this type of jobs works using the same spec as above with some slight modifications:

The only major change we made is adding the .spec.completions  field set to 3, which asks Kubernetes to start 3 pods to perform the same task. Also, we set the app:primes  label for our pods to access them in kubectl  later.

Now, let’s open two terminal windows.

In the first terminal, we are going to watch the pods created:

Save this spec in job-prime-2.yaml  and create the job running the following command in the second terminal:

Next, let’s watch what’s happening in the first terminal window:

As you see, the kubectl  started three pods sequentially waiting for the current pod to complete the operation before starting the next pod.

For more details, let’s check the status of the Job again:

As you see, all three pods successfully completed the task and exited. We can verify that the Job was completed as well by running:

If you look into the logs of each pod, you’ll see that each of them completed the prime numbers calculation successfully (to check logs, take the pod name from the Job description):

That’s it! Parallel jobs with fixed completions count are very useful when you want to perform the same task multiple times. However, what about a scenario when we want one task to be completed in parallel by several pods? Enter parallel jobs with a work queue!

Parallel Jobs with a Work Queue

Parallel jobs with a work queue can create several pods which coordinate with themselves or with some external service which part of the job to work on. If your application has a work queue implementation for some remote data storage, for example, this type of Job can create several parallel worker pods that will independently access the work queue and process it. Parallel jobs with a work queue come with the following features and requirements:

  • for this type of Jobs, you should leave .spec.completions  unset.
  • each worker pod created by the Job is capable to assess whether or not all its peers are done and, thus, the entire Job is done (e.g each pod can check if the work queue is empty and exit if so).
  • when any pod terminates with success, no new pods are created.
  • once at least one pod has exited with success and all pods are terminated, then the job completes with success as well.
  • once any pod has exited with success, other pods should not be doing any work and should also start exiting.

Let’s add parallelism to the previous Job spec to see how this type of Jobs work:

The only difference from the previous spec is that we omitted .spec.completions  field, added the .spec.parallelism  field and set its value to 3.

Now, let’s open two terminal windows as in the previous example. In the first terminal, watch the pods:

Let’s save the spec in the job-prime-3.yaml  and create the job in the second terminal:

Next, let’s see what’s happening in the first terminal window:

As you see, the kubectl  created three pods simultaneously. Each pod was calculating the prime numbers in parallel and once each of them completed the task, the Job was successfully completed as well.

Let’s see more details in the Job description:

As you see, all three pods succeeded in performing the task. You can also check the pods’ logs to see the calculation results:

From the logs above, it may look like our Job with a work queue acted the same way as the job with a fixed completions count. Indeed, all three pods created by the Job calculated prime numbers in the range of 1-110. However, the difference is that in this example all three pods did their work in parallel. If we had created a work queue for our worker pods and some script to process items in that queue, we could make the pods access different batches of numbers (or messages, emails etc.) in parallel until no items left in the queue. In this example, we don’t have a work queue and a script to process it. That’s why all three pods created by the Job did the same task to completion. However, this example is enough to illustrate the main feature of this type of Jobs – parallelism.

K8s Jobs

 

In the real-world scenario, we could imagine a Redis list with some work items (e.g messages, emails) in it and three parallel worker pods created by the Job (see the Image above). Each pod could have a script to requests a new message from the list, process it, and check if there are more work items left. If no more work items exist in the list, the pod accessing it would exit with success telling the controller that the work was successfully done. This notification would cause other pods to exit as well and the entire job to complete. Given this functionality, parallel jobs with a work queue are extremely powerful in processing large volumes of data with multiple workers doing their tasks in parallel.

Cleaning Up

As our tutorial is over, let’s clean up all resources:

Delete the Jobs

Deleting a Job will cause all associated pods to be deleted as well.

Also, delete all files with the Job specs if you don’t need them anymore.

Conclusion

As you have learned, Kubernetes jobs are extremely powerful in parallel computation and batch processing of diverse workloads. However, one should remember that the Job object does not support closely-communicating parallel processes commonly found in scientific computing. The job’s basic use case is parallel or sequential processing of independent but related work items such as messages, emails, numbers, files etc. Whenever you need a batch processing functionality in your Kubernetes apps, Jobs will help you implement it but you’ll need to design your work queue and a script to process it. In the next tutorial, we’ll walk you through several job design patterns that will help you address a number of real-world scenarios for batch processing.