ALL THINGS KUBERNETES

Understanding Kubernetes Kube-Proxy

Kubernetes is a complicated system with multiple components interacting with each other in complex ways. As you may already know, Kubernetes is made of master and node components.

Master components such as kube-scheduler, kube-controller-manager, etcd, and kube-apiserver are part of the Kubernetes Control Plane that runs on K8s master/s. The Plane is responsible for managing the cluster lifecycle, K8s API access, data persistence (etcd) and the maintenance of the desired cluster state.

In their turn, node components such as kubelet, container runtime (e.g., Docker), and kube-proxy run on the nodes and are responsible for managing containerized workloads (kubelet) and Services and for enabling communication between Pods (kube-proxy).

Kube-proxy is one of the most important node components that participates in managing Pod-to-Service and External-to-Service networking. Kubernetes has great documentation about Services that mentions kube-proxy and its modes. However, we would like to discuss this component in depth using practical examples. This will help you understand how Kubernetes Services work under the hood and how kube-proxy manages them by interacting with the networking frameworks inside the Linux kernel. Let’s get started!

So What Are Proxy and Kube-proxy?

A proxy server is any server/host that works as an intermediary between clients requesting resources from some servers and these servers. There are three basic types of proxy servers: (a) tunneling proxies; (b) forward proxies; and (c) reverse proxies.

A tunneling proxy passes unmodified requests from clients to servers on some network. It works as a gateway that enables packets from one network access servers on another network.

A forward proxy is an Internet-facing proxy that mediates client connections to web resources/servers on the Internet. It manages outgoing connections and can service a wide range of resource types.

Finally, a reverse proxy is an internal-facing proxy. It may be thought of as a frontend that controls access to servers on a private network. A reverse proxy takes incoming requests and redirects them to some internal server without the client knowing which one he/she is accessing. This is often done to protect a private network against direct exposure to external users. Reverse proxies can also perform load balancing, authentication, as well as caching and/or decryption.

Kube-proxy

Kube-proxy is the closest to the reverse proxy model in its concept and design (at least in the userspace mode as we’ll see later). As a reverse proxy, kube-proxy is responsible for watching client requests to some IP:port  and forwarding/proxying them to the corresponding service/application on the private network. However, the difference between the kube-proxy and a normal reverse proxy is that the kube-proxy proxies requests to Kubernetes Services and their backend Pods, not hosts. There are some other important differences that we will discuss.

So, as we just noted, the kube-proxy proxies client requests to backend Pods managed by a Service. Its main task is to translate Virtual IPs of Services into IPs of backend Pods controlled by Services. This way, the clients accessing the Service do not need to know which Pods are available for that Service.

Kube-proxy can also work as a load balancer for the Service’s Pods. It can do simple TCP, UDP, and SCTP stream forwarding or round-robin TCP, UDP, and SCTP forwarding across a set of backends.

How Does Kube-proxy Handle NAT?

Network Address Translation (NAT) helps forward packets between different networks. More specifically, it allows packets originating from one network find destinations on the other network. In Kubernetes, we need some sort of NAT to translate Virtual IPs/Cluster IPs of Services into IPs of backend Pods.

However, by default, kube-proxy does not know how to implement this kind of network packet forwarding. Moreover, it needs to account for the fact that Service endpoints, i.e., Pods, are constantly changing. Thus, kube-proxy needs to know the state of the Service network at each point of time to ensure that packets arrive at the right Pods. We will discuss how kube-proxy solves these two challenges in what follows.

Translating Service VIPs into Real IPs

When a new Service of the type “ClusterIP” is created, the system assigns a virtual IP to it. This IP is virtual because there is no network interface or MAC address associated with it. Thus, the network as a whole does not know how to route packets going to this VIP.

How then does kube-proxy know how to route traffic from this virtual IP to the correct Pod? On the Linux systems where Kubernetes runs, kube-proxy closely interacts with the Linux kernel network configuration tools called netfilter and iptables to configure packet routing rules for this VIP. Let’s see how these kernel tools work and how kube-proxy interacts with them.

Netfilter and iptables

Netfilter is a set of Linux kernel hooks that allow various kernel modules register callback functions intercepting network packets and changing their destination/routing. A registered callback function can be thought of as a set of rules tested against every packet passing the network. So the netfilter’s role is to provide the interface for the software working with network rules to match packets against these rules. When a packet matching a rule is found, netfilter takes the specified action (e.g., redirects the packet). In general, netfilter and other components of the Linux networking framework enable packet filtering, network address and port translation (NAPT), and other packet mangling.

To set network routing rules in the netfilter, kube-proxy uses the userspace program called iptables. This program can inspect, forward, modify, redirect, and /or drop IP packets. Iptables consists of five tables: raw , filter , nat , mangle  and security  that configure packets at various stages of their network travel. In its turn, each table has a set of chains – lists of rules followed in order. For example, the filter table consists of INPUT, OUTPUT, and FORWARD chains. When the packet gets to the table, it is first processed by the INPUT chain.

Each chain includes individual rules that consist of condition(s) and corresponding action(s) to take when the condition is met. Here is the example of setting an iptables rule that blocks connection from a specific IP address in the INPUT chain of the filter table: 15.15.15.51  .

Here, INPUT is the chain of the table where the target (the IP address) is filtered and corresponding action (dropping the packet) is taken.

Note: This is a very simplified picture of how iptables work though. If you want to learn more about iptables, check this excellent article from the Arch Linux wiki.

So, we have established that kube-proxy configures the netfilter Linux kernel feature via its user interface – iptables.

However, configuring routing rules is not enough.

IP addresses churn frequently in the containerized environment like Kubernetes. Therefore, kube-proxy has to watch for the Kubernetes API changes such as creating or updating the Service, adding or removing backend Pods IPs and changing iptables rules accordingly so that the routing from the virtual IP always goes to the correct Pod. The details of the process of translating VIPs to real Pods IPs differs depending on the kube-proxy mode selected. Let’s discuss these modes now.

Kube-proxy modes

Kube-proxy can work in three different modes:

  • userspace
  • iptables
  • and IPVS.

Why do we need all these modes? Well, these modes differ in how kube-proxy proxy interacts with the Linux userspace and kernelspace and what roles these spaces play in packet routing and load balancing of traffic to Service’s backends. To make the discussion clear, you should understand the difference between userspace and kernelspace.

Userspace vs. Kernelspace

In Linux, system memory can be divided into two distinct areas: kernel space and user space.

The core of the Operating System known as kernel executes its commands and provides OS services in the kernelspace. All user software and processes installed by users run in the userspace. When they need CPU time for computations, disk for I/O operations or fork the process, they send system calls to the kernel asking for its services.

In general, kernelspace modules and processes are much faster than userspace processes because they interact with the system’s hardware directly. Because the userspace programs should access the kernel services, they are much slower.

Userspace vs. kernelspaceSource: Red Hat.

Now, that you understand the implications of userspace vs. kernelspace, we will discuss all kube-proxy modes.

Userspace Proxy Mode

In the userspace mode, most networking tasks, including setting packet rules and load balancing, are directly performed by the kube-proxy operating in the userspace. In this mode, kube-proxy comes the closest to the role of a reverse proxy that involves listening to traffic, routing traffic, and load balancing between traffic destinations. Also, in the userspace mode, kube-proxy must frequently switch context between userspace and kernelspace when it interacts with iptables and does load balancing.

Proxying traffic between the VIPs and backend Pods in the userspace mode is done in four steps:

  • kube-proxy watches for the creation/deletion of Services and their Endpoints (backend Pods).
  • When a new Service of a type ClusterIP is created, kube-proxy opens a random port on the node. The aim is to proxy any connection to this port to one of the Service’s backend Endpoints. The choice of the backend Pod is based on the SessionAffinity  of the Service.
  • kube-proxy installs iptables rules that intercept traffic to the Service’s VIP and Service Port and redirect that traffic to the host port opened in the step above.
  • When the redirected traffic gets to the node’s port, kube-proxy works as a load balancer distributing traffic across the backend Pods. The choice of the backend Pod is round robin by default.

As you see, in this mode kube-proxy works as a userspace proxy that opens a proxy port, listens on it, and redirects packets from the port to the backend Pods.

This approach involves much context-switching, however. Kube-proxy has to switch to the kernelspace when VIPs are redirected to the proxy port and then back to the userspace to load balance between the set of backend Pods. This is because it does not install iptables rules for load balancing between Service endpoints/backends. Thus, load balancing is done directly by the kube-proxy in the userspace. As a result of frequent context-switching, userspace mode is not as fast and scalable as other two modes we are about to describe.

Kube-proxy userspace mode

 

Example #1: Userspacemode

Let’s illustrate how the userspace mode works using an example in the image above. Here, kube-proxy opens a random port (10400) on the node’s eth0  interface after the Service with the ClusterIP 10.104.141.67  is created.

Then, kube-proxy creates netfilter rules that reroute packets sent to the service VIP to the proxy port. After the packets get to this port, kube-proxy selects one of the backend Pods (e.g Pod 1) and forwards traffic to it. As you can imagine, a number of intermediary steps are involved in this process.

Iptables Mode

Iptables is the default kube-proxy mode since Kubernetes v1.2 and allows for faster packet resolution between Services and backend Pods than the userspace mode.

In the iptables mode, kube-proxy no longer works as a reverse proxy load balancing the traffic between backend Pods. This task is delegated to iptables/netfilter. Iptables is tightly integrated with netfilter, so there is no need to frequently switch context between the userspace and the kernelspace. Also, load balancing between backend Pods is done directly via the iptables rules.

This is how the entire process looks like (see the image below):

  • As in the userspace mode, kube-proxy watches for the creation/deletion of Services and their Endpoints objects.
  • However, instead of opening a random port on the host when a new Service is created/updated, kube-proxy immediately installs iptables rules that capture traffic to the Service’s ClusterIP and Port and redirect it to one of the Service’s backend sets.
  • Also, kube-proxy installs iptables rules for each Endpoint object. These rules are used by iptables to select a backend Pod. By default, the choice of backend Pod is random.

Thus, in the iptables mode, kube-proxy fully delegates the task of redirecting traffic and load balancing between the backend Pods to netfilter/iptables. All these tasks happen in the kernelspace, which makes the process much more faster than in the userspace mode.

kube-proxy iptables mode

 

However, kube-proxy retains its role of keeping netfilter rules in sync. It constantly watches for Service and Endpoints updates and changes iptables rules accordingly.

Iptables mode is great, but it has one tangible limitation. Remember that in the userspace mode kube-proxy directly load balances between Pods? It can select another Pod if the one it’s trying to access does not respond. Iptables rules, however, don’t have the mechanism to automatically retry another Pod if the one it initially selects does not respond. Therefore, this mode depends on having working readiness probes.

Example #2: Check iptables rules created by kube-proxy for a Service

In this example, we demonstrate how to access iptables rules created by kube-proxy for the HTTPD Service. This example was tested on Kubernetes 1.13.0 running on Minikube 0.33.1.

First, let’s create a HTTPD Deployment:

Next, expose it via Service:

We need to know the ClusterIP of the Service to identify it later. It is 10.104.141.67  as the output below suggests:

Iptables rules are installed by the kube-proxy Pod so we’ll need to get its name first.

Finally, get a shell to the running kube-proxy Pod:

We can now access the iptables inside the kube-proxy. For example, you can list all rules in the nat  table like this:

Or, you can list all rules in the custom KUBE-SERVICES  chain designed to store Services rules in the nat  table.

This chain includes a list of rules for your K8s Services:

As the third rule suggests, the traffic to our Service with the ClusterIP 10.104.141.67  is forwarded to #default/httpd-deployment  (the Service’s backend Pods) via TCP dpt:http forwarding. This forwarding is performed directly by iptables using random Pod selection.

IPVS Mode

The IPVS mode graduated to General Availability in the K8s v1.11. It’s considered to be the most efficient and modernized way to route cluster traffic to backend Pods.

In the IPVS mode, load balancing is performed using IPVS (IP Virtual Server). Built on top of netfilter, IPVS (IP Virtual Server) implements transport-layer load balancing as part of the Linux kernel. It is a component of Linux Virtual Server (LVS).

IPVS can direct requests for TCP- and UDP-based services to the real servers, and make services on these servers appear as virtual services on a single IP address. Sounds familiar? This is exactly what K8s Services need.

Thus, you may think of IPVS as a Linux kernel load balancer similar to what kube-proxy does in the userspace mode.

kube-proxy ipvs

IPVS mode has a number of benefits over the iptables mode.

As we’ve already mentioned, iptables was originally designed mainly for firewalls which normally require a dozen of rules and conditions to work properly. However, in the Kubernetes environment, we may have thousands of Services all of which require specific networking rules. Iptables do not scale very well in such an environment.

For example, let’s say we have a 5000-node cluster already supported in K8s v.1.6. If we have 2000 services and each service has 10 Pods, then we’ll need 20000 iptables records on each node in the cluster. Moreover, iptables needs to constantly update these rules because the backend Pods’ IPs churn frequently in the containerized environment. This would make the Linux kernel pretty busy and overloaded all the time in large production clusters.

IPVS mode solves this scalability issue for large clusters. IPVS was specifically designed for load balancing, and it uses hash tables to store network rules more efficiently than iptables. This allows for almost unlimited scale and fast network throughput as all processes occur in the kernelspace.

The kube-proxy process in this mode looks as follows (see the image above):

  • kube-proxy watches K8s Services and Pod Endpoints. If a new Service is created, kube-proxy calls the netlink interface to create IPVS rules.
  • Also, it periodically syncs IPVS rules with the K8s Services and Endpoints to make sure that the desired state is maintained.
  • When Service is accessed, IPVS load balancer redirects traffic to backend Pods.

Load balancing between backend Pods is done by the round-robin algorithm by default. However, the mode supports other load balancing algorithms such as:

  • lc: least connection
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never queue

You can select the algorithm by using --ipvs-scheduler  flag on the kube-proxy.

Note: ipvs  mode requires IPVS kernel modules to be installed on the node before running kube-proxy. Ff the modules are not installed,  kube-proxy will fall back to iptables proxy mode.

Conclusion

That’s it! We hope that this article shed some light on how kube-proxy works under the hood. Userspace mode is the legacy kube-proxy mode that is no longer used by default. When running Kubernetes in production the main choice is between the iptables and IPVS modes. Iptables is good for most small and medium clusters, but, it may cause scalability issues when working with thousands of Services and rules. IPVS mode is thus a better option for large production clusters, but it requires a lot of configuration that is beyond the scope of this article.

Subscribe to our newsletter