Sunday, November 17, 2024

notes on installing Kubernetes on Debian

Not instructions per se, but a guide to the instructions plus thoughts on choices that arise in installing Kubernetes onto a Debian machine.

I began writing this posting as specific instructions on how to install a specific version of Kubernetes onto a specific version of Debian. In the process, I realized that there isn't a one-size-fits-all approach to installing Kubernetes. So these are the notes that remain from what I learned that are still generally applicable. Most of their value is that they linearize the thought process of installing Kubernetes from scratch.

The Kubernetes website has good, thorough instructions on how to install Kubernetes on Debian, and its instructions on “Installing kubeadm” include pointers to installing CRI-O as well; those pages should be considered the authority on the subject. (They will be referenced again, below.)

If you're not familiar with the Kubernetes cluster architecture, you should consult that page as needed.

not the quick solution

If you're looking to quickly get Kubernetes running on a small scale, there are faster ways. This post is aimed at the production-grade approach, the kind that a business (e.g., one of my employers or clients, and maybe one of yours) would use. So it's aimed at developing an understanding of how business-grade systems work. If you want the quick solution, here are some that were recommended to me: k3s, RKE2, microk8s, Charmed Kuber­netes, OpenShift.

caveats

These are notes that I made while going through the process for the first time. I haven't tested them a second time, except inasmuch as I had to repeat parts for the second node. So there may be things that I forgot to document.

These notes are based on Debian 12, which mostly means that systemd is used.

nodes

The first subject to consider is what are the nodes on which you'll install the Kubernetes cluster. Kubernetes separates its design into a control plane and—what might be called a “data plane”—a set of worker nodes.

  • The control plane runs on nodes, and may run all on one node, or replicated across multiple nodes for high availability. There must be at least one control plane node in order for a cluster to exist.
  • There must be at least one worker node in order for the cluster to service workloads.
  • The control plane and workload processes can be mixed together on a node. This requires removing taints on such nodes that would normally prohibit workloads from running there. (See InitConfiguration.nodeRegistration.taints for the setting.) The case of having a single-node cluster with both control plane and workload combined may be useful in development or testing.
  • Extra control plane or worker nodes can be added for redundancy. A common choice in production is to have an odd number (at least three) of control plane nodes; this can help the cluster decide which part is authoritative if the network is partitioned. Distributing the nodes across hardware, racks, power supplies, or data centers can improve the reliability.

There are several ways to place the workload with respect to the control plane. I chose to separate the control plane into nodes that are separate from the worker nodes, in order to protect the control plane services from any workload that may manage to break out of its restraints (Linux namespace and cgroup). I’m starting with one node of each type.

The Kubernetes tools (kubdadm in particular) allow one to reconfigure a cluster so that the initial configuration doesn't matter much.

Xen domUs

If your machines are Xen domUs, you'll want to set vcpus equal to maxcpus in the domU config, because vcpus is what determines the number of vCPUs that appear in /proc/cpuinfo, and thus determines the number of vCPUs that Kubernetes believes to be present. If you over-allocate the vCPUs among the Xen domains, perhaps in order to ensure that they're not underutilized, you can use scheduler weights to affect the priority that each domain has to the CPUs.

For example, with a 24-core, 32-thread Intel CPU, 32 Xen vCPUs would be available and could be allocated thus:

  • dom0: dom0_max_vcpus=1-4
  • control-plane domU: vcpus = 2, maxcpus = 2, cpu_weight = 192
  • worker domU: vcpus = 30, maxcpus = 30, cpu_weight = 128

deployment method

On dedicated hardware, there are several ways to deploy the control plane:

  • A “traditional” deployment uses systemd to run (and keep running) the control plane services. This is a manual configuration. Using kubeadm is preferred because it configures the cluster in a more consistent way; see the next point.
  • The “static pods” deployment, used by kubeadm, lets kubelet manage the life of the control plane services as static pods.
  • The “self-hosted” deployment, in which the cluster itself manages the running of the control plane services. This seems a bit fragile in that a problem in a control plane could cause the whole control plane to fail.

So the way that I prefer for my on-premises hardware is to use kubeadm.

Debian repository limitations

Note that the Debian repository has kubernetes-client, containerd, crun and runc packages, but we're not using these since, as is usual with stable Debian releases, the packages are out-of-date by a year on average, and updates to this code are frequent and often security-related. Also there are restrictions on which version of kubectl can be used with kubelet: usually they should differ by no more than one minor version. Further, the repository doesn't contain the other Kubernetes packages. So simply installing from Debian isn't an option.

package installation

These deb packages need to be installed on every node. The cri-o package's services run containers on the node, and the kubelet service connects the node to the control plane. Some of the control plane is implemented by the running of containers (that perform actions specific to the control plane needs) so all the infrastructure is needed on every node. The kubeadm package is only needed for managing the node's membership in the cluster. The kubectl package may not be strictly necessary on every node, but it helps to have it if proxying isn't working.

The process involves setting up apt to fetch from the Docker and Kubernetes Debian-style repos. These official Debian packages will be needed to do that:

apt install curl gpg

CRI-O

Kubernetes first needs a container runtime that supports CRI. CRI-O is the more modern choice of the popular implementations. You'll first need to check some Linux prerequisites are in place, according to the instructions on the “Container Runtimes” page. The CRI-O project has installation instructions, which include Debian-specific instructions.

Following those instructions, and before running systemctl start crio, it's necessary to remove or move aside CRI-O's *-crio-*.conflist CNI configuration files from /etc/cni/net.d/. We will use Calico's CNI configuration instead, which the Calico operator will install. Then continue with running systemctl start crio. Stop after that; the rest of the instructions are addressed in the next step.

Kubernetes will automatically detect that CRI-O is running, from the presence of its UNIX socket. CRI-O needs an OCI-compatible runtime to which it can delegate the actual running of containers; it comes with crun and is configured to use it by default.

Kubernetes

Continue on by following “Installing kubeadm, kubelet and kubectl”. (You will have done most of this in the previous step. The apt-mark hold is still needed.)

Once all the packages are installed, you're ready to run kubeadm init. But first you'll need to understand how to configure it.

dual-stack IPv4/IPv6

In order to avoid DNAT to any Services published from the Kubernetes cluster and SNAT from any Pods, one would like the Pods and Services to have publicly-routable IPv6 addresses. (IPv4 addresses on the Pods would probably be RFC1918 internal addresses, since IPv4 addresses aren't as abundant, and so would need to be NATed in any case.) The Pods and Services should have both IPv4 and IPv6 addresses, so that they can interact easily with either stack. The IPv6 addresses that we're using here are only for use in documentation. They're presumably global-scope unicast addresses; we could instead use site-local addresses, with the caveat that we'd need to NAT external communication.

This is a choice that needs to be made when the cluster is created; it can't be changed later, nor can the chosen CIDRs (at least not via kubeadm). One subnet of each type is needed for Pods and Services. This of course assumes that you have an IPv6 subnet assigned to you from which you can allocate. When sizing the subnets, mind that every Pod and Service needs an address.

For the examples, we'll use 10.0.0.0/16 and 2001:db8:0:0::/64 for the Pod subnet, and 10.1.0.0/16 and 2001:db8:0:1::/108 for the Service subnet. (The services IPv6 subnet can be no larger than /108. This isn't documented, but the scripts check for it.) For your real-life cluster, I recommend that you choose a random 10. subnet, instead of these above or the 10.96 default, to avoid address collisions when you have access to multiple clusters or private networks in general. It's probably best to choose two consecutive subnets as above, to make firewalling rules shorter; that is, 10.0.0.0/15 and 2001:db8:0:0::/63 cover both of the above pairs of subnets.

Forwarding of IPv4 packets must be enabled for most CNI implementations, but most will do this automatically if needed. For dual-stack, you'll also need to enable IPv6 forwarding; it's not clear whether CNI implementations will also do this automatically. In any case, these settings are required for the kubeadm pre-flight checks to pass. Run this on every node:

sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv6.conf.all.forwarding=1

and create /etc/sysctl.d/k8s.conf:

net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1

pod network plugin (Calico)

Kubernetes requires a pod network plugin to manage the networking between containers. Container networking is designed to scale to an extremely large number of containers, while still presenting a simple OSI layer 2 (Ethernet), or at least a layer 3 (IP), view of the Kubernetes container network. Optimal use of the network avoids using an overlay network (e.g., VXLAN or IPIP) unless necessary (e.g., between nodes). Here we can avoid an overlay because we have full control over the on-prem network. Specifically, we can use an arbitrary subnet for the pods and services, but still route packets across the cluster, or into or out of the cluster, because we can adjust the routing tables as needed.

Calico also supports NetworkPolicys, a desirable feature.

The Pod's DNS Policy is typically the default, ClusterFirst (not Default!); setting hostNetwork: true in the PodSpec would be unusual.

full-mesh BGP routing

Since we have a small (two-node) cluster here, we're going to use a BGP full-mesh routing infrastructure, as provided by Calico. This is the default mode of the Calico configuration. As long as the nodes are all on the same ISO Level 2 network, their BGP servers can find each other and forward packets, without resorting to IPIP encapsulation.

kubeadm init options

The kubeadm init command accepts a (YAML) configuration file with an InitConfiguration and/or a ClusterConfiguration. Let's call it kubeadm-init.yaml:

---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration

clusterName: kubernetes
controlPlaneEndpoint: control.kubernetes.internal

networking:
  dnsDomain: kubernetes.internal
  podSubnet:     10.0.0.0/16,2001:db8:0:0::/64
  serviceSubnet: 10.1.0.0/16,2001:db8:0:1::/108

apiServer:
  extraArgs:
  - name: service-cluster-ip-range
    value: 10.1.0.0/16,2001:db8:0:1::/108  # same as networking.serviceSubnet

controllerManager:
  extraArgs:
  - name: cluster-cidr
    value: 10.0.0.0/16,2001:db8:0:0::/64  # same as networking.podSubnet

  - name: service-cluster-ip-range
    value: 10.1.0.0/16,2001:db8:0:1::/108  # same as networking.serviceSubnet

  - name: node-cidr-mask-size-ipv4
    value: "16"

---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration

nodeRegistration:
  kubeletExtraArgs:
  # Required for dual-stack; defaults only to IPv4 default address
  # Has no KubeletConfiguration parameter, else we'd set it there.
  - name: node-ip
    value: 192.168.4.2,2001:db8:0:4::2

---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

cgroupDriver: systemd
clusterDomain: kubernetes.internal

# Enables the node to work with and use swap.
# See https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory
failSwapOn: false
memorySwap:
  swapBehavior: LimitedSwap

---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration

mode: nftables
clusterCIDR: 10.0.0.0/16,2001:db8:0:0::/64  # same as ClusterConfiguration.networking.podSubnet
  • Although this looks like a YAML file, it doesn't accept all of the YAML 1.2 syntax. If you try to place comments or directives before the first "---", kubeadm will fail to parse it.
  • Note that the convention that I use here is that the DNS domain in which all of this cluster's parts exist is formed from the name of the cluster (here, kubernetes). So the domain is kubernetes.internal, the load-balanced control endpoint is control.kubernetes.internal, the Services subdomain is service.kubernetes.internal, etc.
  • We're going to pretend here that the machine on which we're installing Kubernetes has IP addresses 192.168.4.2 and 2001:db8:0:4::2. In order for the kubelet service on the (worker) node to listen on both the IPv4 and IPv6 addresses of its node, we need to configure InitConfiguration.nodeRegistration.kubeletExtraArgs as above. Else it would only listen on one of those.
  • We're using here the subnets that we chose above.
  • Since we're preparing here for later expansion of the control plane to other nodes, we'll need to specify the ClusterConfiguration.​controlPlaneEndpoint. (This represents the --control-plane-endpoint command-line value.) Easiest is to use DNS round-robin A records, but since I plan to run DNS on the cluster, I'll just add an /etc/hosts entry for now.
  • We'll also need a configuration file for kubelet, which we'll call kubelet.yaml and put in the same directory:
  • In the KubeletConfiguration, we're taking advantage of any swap that may be available to the machine.

kubeadm reset

If the kubeadm init or kubeadm join fails to complete, kubeadm reset can be used to revert most of its effect. The command output spells out what other state needs clean-up.

initial control plane node

The first control plane node is where the Kubernetes cluster is created/initialized. Any additional control plane node will be joined to the cluster, and so would be handled differently.

It's assumed that Debian is already installed on the node, and that there is no firewalling set up on the node. As an option, Kubernetes provides its own mechanism for adding firewalling that's compatible with its services, which we'll add below.

build the cluster

Then we run:

kubeadm init --config=kubeadm-init.yaml

You should use script to capture the command's output when you run it, since the final output includes a token and certificate needed for future steps. If you perform the configuration setup that's described in the command's final output, you can then, as a non-root user, run kubectl cluster-info to verify that the cluster's running. You might also want to enable Bash autocompletion for kubectl.

deploy Calico

Once the cluster is running, and before adding applications, we need to install the network plugin. This has two parts: install the Tigera Kubernetes Operator, and deploy Calico (using the operator). We'll follow the Calico instructions for a self-managed, on-premises cluster. (The same thing could be done with Helm, but that presumes having Helm already installed.) If you've set up the non-root user's ~/.kube/config according to the instructions output by kubeadm init, you can run the instructions as that user.

You will need to customize the custom-resources.yaml file's Installation.spec.calicoNetwork.ipPools to match the IPv4 and IPv6 pools chosen above. Calico has instructions for configuring for dual-stack. For example, given the above subnets, the Installation configuration should be:

apiVersion: operator.tigera.io/v1
kind: Installation

metadata:
  name: default
spec:
  calicoNetwork:
    linuxDataplane: Nftables
    ipPools:
    - name: default-ipv4-pool
      cidr: 10.0.0.0/16  # from ClusterConfiguration.networking.podSubnet
      encapsulation: None
      natOutgoing: Enabled

    - name: default-ipv6-pool
      cidr: 2001:db8:0:0::/64  # from ClusterConfiguration.networking.podSubnet
      encapsulation: None
      natOutgoing: Disabled
---
apiVersion: operator.tigera.io/v1
kind: APIServer

metadata:
  name: default
spec: {}
  • We're assuming that the IPv6 addresses are global scope, so SNAT is disabled; if we chose site-local addresses instead, then we would enable SNAT.
  • We're assuming that the Kubernetes nodes—on which BIRD, Calico's BGP server, run—are all connected together on the same ISO Level 2 network, so that the BIRD instances can find each other automatically.
  • We need to tell Calico to use nftables, consistent with our choice for Kubernetes.

If you have a slow Internet connection, it may take some minutes for the Calico pods to come up, because images need to be downloaded. (This is, of course, true for any new images that you'll be starting.)

You can optionally validate dual-stack networking.

worker node(s)

Before you can add deploy workloads to the cluster, you'll need to add a worker node. We assume that a machine has been set up according to the “package installation” and “dual-stack” sections above. The end of the output from the kubeadm init command above contains the kubeadm join command that you should run on the machine that will be the new worker node, e.g.:

kubeadm join control.kubernetes.internal:6443 --token asdf… \
        --discovery-token-ca-cert-hash sha256:deadbeef…

The preferred way to use these values is in a JoinConfiguration file (e.g., called kubeadm-join.yaml):

apiVersion: kubeadm.k8s.io/v1beta4
kind: JoinConfiguration

discovery:
  bootstrapToken:
    apiServerEndpoint: control.kubernetes.internal:6443
    token: "asdf…"
    caCertHashes:
    - "sha256:deadbeef…"

nodeRegistration:
  kubeletExtraArgs:
  - name: "node-ip"
    value: "192.168.4.3,2001:db8:0:4::3"
  • We're going to pretend here that the machine on which we're installing Kubernetes has IP addresses 192.168.4.3 and 2001:db8:0:4::3, analogous to what we did in the InitConfiguration
  • The token is only good for 24 hours (by default). You can create a new one with kubeadm token create on the control node.

Then you can run kubeadm join --config=kubeadm-join.yaml control.kubernetes.internal:6443. (Note that the server name, control.kubernetes.internal, needs to be resolvable on the worker node. So you might need to add it to /etc/hosts, DNS, etc.)

machine shutdown

So what happens when the machine on which we're running the whole cluster shuts down? Debian uses systemd, so the cri-o and kubelet services will be shut down, and with them all of the workload and control plane containers that run.

upgrading

Kubernetes releases a new minor version about four times per year, with a few patch versions in between. CRI-O follow the Kubernetes minor release cycle. See “Upgrading kubeadm clusters” for the full details. “Skipping MINOR versions when upgrading is unsupported.”

You may wonder why we need to place a hold on the Kubernetes packages. Couldn't we remove the hold and perform the upgrades automatically, using unattended-upgrades? The answer is no, because there are sometimes manual steps required even after the software packages have been upgraded. Specifically, a kubeadm upgrade apply will be needed, and sometimes configuration API may change.

next steps