Not instructions per se, but a guide to the instructions plus thoughts on choices that arise in installing Kubernetes onto a Debian machine.
I began writing this posting as specific instructions on how to install a specific version of Kubernetes onto a specific version of Debian. In the process, I realized that there isn't a one-size-fits-all approach to installing Kubernetes. So these are the notes that remain from what I learned that are still generally applicable. Most of their value is that they linearize the thought process of installing Kubernetes from scratch.
The Kubernetes website has good, thorough instructions on how to install Kubernetes on Debian, and its instructions on “Installing kubeadm” include pointers to installing CRI-O as well; those pages should be considered the authority on the subject. (They will be referenced again, below.)
If you're not familiar with the Kubernetes cluster architecture, you should consult that page as needed.
not the quick solution
If you're looking to quickly get Kubernetes running on a small scale, there are faster ways. This post is aimed at the production-grade approach, the kind that a business (e.g., one of my employers or clients, and maybe one of yours) would use. So it's aimed at developing an understanding of how business-grade systems work. If you want the quick solution, here are some that were recommended to me: k3s, RKE2, microk8s, Charmed Kubernetes, OpenShift.
caveats
These are notes that I made while going through the process for the first time. I haven't tested them a second time, except inasmuch as I had to repeat parts for the second node. So there may be things that I forgot to document.
These notes are based on Debian 12, which mostly means that systemd
is used.
nodes
The first subject to consider is what are the nodes on which you'll install the Kubernetes cluster. Kubernetes separates its design into a control plane and—what might be called a “data plane”—a set of worker nodes.
- The control plane runs on nodes, and may run all on one node, or replicated across multiple nodes for high availability. There must be at least one control plane node in order for a cluster to exist.
- There must be at least one worker node in order for the cluster to service workloads.
-
The control plane and workload processes can be mixed together on a node.
This requires removing taints on such nodes that would normally prohibit workloads from running there.
(See
InitConfiguration.nodeRegistration.taints
for the setting.) The case of having a single-node cluster with both control plane and workload combined may be useful in development or testing. - Extra control plane or worker nodes can be added for redundancy. A common choice in production is to have an odd number (at least three) of control plane nodes; this can help the cluster decide which part is authoritative if the network is partitioned. Distributing the nodes across hardware, racks, power supplies, or data centers can improve the reliability.
There are several ways to place the workload with respect to the control plane. I chose to separate the control plane into nodes that are separate from the worker nodes, in order to protect the control plane services from any workload that may manage to break out of its restraints (Linux namespace and cgroup). I’m starting with one node of each type.
The Kubernetes tools (kubdadm
in particular) allow one to reconfigure a cluster
so that the initial configuration doesn't matter much.
Xen domUs
If your machines are Xen domUs, you'll want to set vcpus
equal to maxcpus
in the domU config,
because vcpus
is what determines the number of vCPUs that appear in /proc/cpuinfo
,
and thus determines
the number of vCPUs that Kubernetes believes to be present.
If you over-allocate the vCPUs among the Xen domains, perhaps in order to ensure that they're not underutilized,
you can use scheduler weights to affect the priority that each domain has to the CPUs.
For example, with a 24-core, 32-thread Intel CPU, 32 Xen vCPUs would be available and could be allocated thus:
- dom0:
dom0_max_vcpus=1-4
- control-plane domU:
vcpus = 2
,maxcpus = 2
,cpu_weight = 192
- worker domU:
vcpus = 30
,maxcpus = 30
,cpu_weight = 128
deployment method
On dedicated hardware, there are several ways to deploy the control plane:
-
A “traditional” deployment uses
systemd
to run (and keep running) the control plane services. This is a manual configuration. Usingkubeadm
is preferred because it configures the cluster in a more consistent way; see the next point. -
The “static pods” deployment, used by
kubeadm
, letskubelet
manage the life of the control plane services as static pods. - The “self-hosted” deployment, in which the cluster itself manages the running of the control plane services. This seems a bit fragile in that a problem in a control plane could cause the whole control plane to fail.
So the way that I prefer for my on-premises hardware is to use kubeadm
.
Debian repository limitations
Note that the Debian repository has kubernetes-client
, containerd
, crun
and runc
packages,
but we're not using these since, as is usual with stable Debian releases, the packages are out-of-date by a year on average,
and updates to this code are frequent and often security-related.
Also there are restrictions on which version of kubectl
can be used with kubelet
:
usually they
should differ by no more than one minor version.
Further, the repository doesn't contain the other Kubernetes packages.
So simply installing from Debian isn't an option.
package installation
These deb
packages need to be installed on every node.
The cri-o
package's services run containers on the node,
and the kubelet
service connects the node to the control plane.
Some of the control plane is implemented by the running of containers (that perform actions specific to the control plane needs)
so all the infrastructure is needed on every node.
The kubeadm
package is only needed for managing the node's membership in the cluster.
The kubectl
package may not be strictly necessary on every node, but it helps to have it if proxying isn't working.
The process involves setting up apt
to fetch from the Docker and Kubernetes Debian-style repos.
These official Debian packages will be needed to do that:
apt install curl gpg
CRI-O
Kubernetes first needs a container runtime that supports CRI. CRI-O is the more modern choice of the popular implementations. You'll first need to check some Linux prerequisites are in place, according to the instructions on the “Container Runtimes” page. The CRI-O project has installation instructions, which include Debian-specific instructions.
Following those instructions, and before running systemctl start crio
,
it's necessary to remove or move aside CRI-O's *-crio-*.conflist
CNI configuration files
from /etc/cni/net.d/
.
We will use Calico's CNI configuration instead, which the Calico operator will install.
Then continue with running systemctl start crio
.
Stop after that; the rest of the instructions are addressed in the next step.
Kubernetes will automatically detect that CRI-O is running, from the presence of its UNIX socket.
CRI-O needs an OCI-compatible runtime to which it can delegate the actual running of containers;
it comes with crun
and is configured to use it by default.
Kubernetes
Continue on by following
“Installing kubeadm, kubelet and kubectl”.
(You will have done most of this in the previous step. The apt-mark hold
is still needed.)
Once all the packages are installed, you're ready to run kubeadm init
.
But first you'll need to understand how to configure it.
dual-stack IPv4/IPv6
In order to avoid DNAT to any Services published from the Kubernetes cluster and SNAT from any Pods, one would like the Pods and Services to have publicly-routable IPv6 addresses. (IPv4 addresses on the Pods would probably be RFC1918 internal addresses, since IPv4 addresses aren't as abundant, and so would need to be NATed in any case.) The Pods and Services should have both IPv4 and IPv6 addresses, so that they can interact easily with either stack. The IPv6 addresses that we're using here are only for use in documentation. They're presumably global-scope unicast addresses; we could instead use site-local addresses, with the caveat that we'd need to NAT external communication.
This is a choice that needs to be made when the cluster is created; it can't be changed later, nor can the chosen CIDRs
(at least
not via kubeadm
).
One subnet of each type is needed for Pods and Services.
This of course assumes that you have an IPv6 subnet assigned to you from which you can allocate.
When sizing the subnets, mind that every Pod and Service needs an address.
For the examples, we'll use 10.0.0.0/16
and 2001:db8:0:0::/64
for the Pod subnet,
and 10.1.0.0/16
and 2001:db8:0:1::/108
for the Service subnet.
(The services IPv6 subnet can be no larger than /108
. This isn't documented, but the scripts check for it.)
For your real-life cluster, I recommend that you choose a random 10.
subnet,
instead of these above or the 10.96
default,
to avoid address collisions when you have access to multiple clusters or private networks in general.
It's probably best to choose two consecutive subnets as above, to make firewalling rules shorter;
that is, 10.0.0.0/15
and 2001:db8:0:0::/63
cover both of the above pairs of subnets.
Forwarding of IPv4 packets must be enabled for most CNI implementations,
but most will do this automatically if needed.
For dual-stack,
you'll also need to enable IPv6 forwarding;
it's not clear whether CNI implementations will also do this automatically.
In any case, these settings are required for the kubeadm
pre-flight checks to pass.
Run this on every node:
sysctl -w net.ipv4.ip_forward=1 sysctl -w net.ipv6.conf.all.forwarding=1
and create /etc/sysctl.d/k8s.conf
:
net.ipv4.ip_forward=1 net.ipv6.conf.all.forwarding=1
pod network plugin (Calico)
Kubernetes requires a pod network plugin to manage the networking between containers. Container networking is designed to scale to an extremely large number of containers, while still presenting a simple OSI layer 2 (Ethernet), or at least a layer 3 (IP), view of the Kubernetes container network. Optimal use of the network avoids using an overlay network (e.g., VXLAN or IPIP) unless necessary (e.g., between nodes). Here we can avoid an overlay because we have full control over the on-prem network. Specifically, we can use an arbitrary subnet for the pods and services, but still route packets across the cluster, or into or out of the cluster, because we can adjust the routing tables as needed.
Calico also supports NetworkPolicys, a desirable feature.
The
Pod's DNS Policy
is typically the default, ClusterFirst
(not Default
!);
setting hostNetwork: true
in the
PodSpec
would be unusual.
full-mesh BGP routing
Since we have a small (two-node) cluster here, we're going to use a BGP full-mesh routing infrastructure, as provided by Calico. This is the default mode of the Calico configuration. As long as the nodes are all on the same ISO Level 2 network, their BGP servers can find each other and forward packets, without resorting to IPIP encapsulation.
kubeadm init
options
The kubeadm init
command accepts a (YAML) configuration file
with an InitConfiguration
and/or a ClusterConfiguration
.
Let's call it kubeadm-init.yaml
:
--- apiVersion: kubeadm.k8s.io/v1beta4 kind: ClusterConfiguration clusterName: kubernetes controlPlaneEndpoint: control.kubernetes.internal networking: dnsDomain: kubernetes.internal podSubnet: 10.0.0.0/16,2001:db8:0:0::/64 serviceSubnet: 10.1.0.0/16,2001:db8:0:1::/108 apiServer: extraArgs: - name: service-cluster-ip-range value: 10.1.0.0/16,2001:db8:0:1::/108 # same as networking.serviceSubnet controllerManager: extraArgs: - name: cluster-cidr value: 10.0.0.0/16,2001:db8:0:0::/64 # same as networking.podSubnet - name: service-cluster-ip-range value: 10.1.0.0/16,2001:db8:0:1::/108 # same as networking.serviceSubnet - name: node-cidr-mask-size-ipv4 value: "16" --- apiVersion: kubeadm.k8s.io/v1beta4 kind: InitConfiguration nodeRegistration: kubeletExtraArgs: # Required for dual-stack; defaults only to IPv4 default address # Has no KubeletConfiguration parameter, else we'd set it there. - name: node-ip value: 192.168.4.2,2001:db8:0:4::2 --- apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration cgroupDriver: systemd clusterDomain: kubernetes.internal # Enables the node to work with and use swap. # See https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory failSwapOn: false memorySwap: swapBehavior: LimitedSwap --- apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration mode: nftables clusterCIDR: 10.0.0.0/16,2001:db8:0:0::/64 # same as ClusterConfiguration.networking.podSubnet
-
Although this looks like a YAML file, it doesn't accept all of the YAML 1.2 syntax.
If you try to place comments or directives before the first "---",
kubeadm
will fail to parse it. -
Note that the convention that I use here is that the DNS domain in which all of this cluster's parts exist
is formed from the name of the cluster (here,
kubernetes
). So the domain iskubernetes.internal
, the load-balanced control endpoint iscontrol.kubernetes.internal
, the Services subdomain isservice.kubernetes.internal
, etc. -
We're going to pretend here that the machine on which we're installing Kubernetes
has IP addresses 192.168.4.2 and 2001:db8:0:4::2.
In order for the
kubelet
service on the (worker) node to listen on both the IPv4 and IPv6 addresses of its node, we need to configureInitConfiguration.nodeRegistration.kubeletExtraArgs
as above. Else it would only listen on one of those. - We're using here the subnets that we chose above.
-
Since we're preparing here for later expansion of the control plane to other nodes,
we'll need to specify the
ClusterConfiguration.controlPlaneEndpoint
. (This represents the--control-plane-endpoint
command-line value.) Easiest is to use DNS round-robinA
records, but since I plan to run DNS on the cluster, I'll just add an/etc/hosts
entry for now. -
We'll also need a configuration file for
kubelet
, which we'll callkubelet.yaml
and put in the same directory: -
In the
KubeletConfiguration
, we're taking advantage of any swap that may be available to the machine.
kubeadm reset
If the kubeadm init
or kubeadm join
fails to complete,
kubeadm reset
can be used to revert most of its effect.
The command output spells out what other state needs clean-up.
initial control plane node
The first control plane node is where the Kubernetes cluster is created/initialized. Any additional control plane node will be joined to the cluster, and so would be handled differently.
It's assumed that Debian is already installed on the node, and that there is no firewalling set up on the node. As an option, Kubernetes provides its own mechanism for adding firewalling that's compatible with its services, which we'll add below.
build the cluster
Then we run:
kubeadm init --config=kubeadm-init.yaml
You should use script
to capture the command's output when you run it,
since the final output includes a token and certificate needed for future steps.
If you perform the configuration setup that's described in the command's final output,
you can then, as a non-root
user,
run kubectl cluster-info
to verify that the cluster's running.
You might also want to
enable Bash autocompletion for kubectl
.
deploy Calico
Once the cluster is running, and before adding applications, we need to install the network plugin.
This has two parts: install the Tigera Kubernetes Operator, and deploy Calico (using the operator).
We'll follow
the Calico instructions for a self-managed, on-premises cluster.
(The
same thing could be done with Helm, but that presumes having Helm already installed.)
If you've set up the non-root user's ~/.kube/config
according to the instructions output by kubeadm init
,
you can run the instructions as that user.
You will need to customize the custom-resources.yaml
file's
Installation.spec.calicoNetwork.ipPools
to match the IPv4 and IPv6 pools chosen above.
Calico has
instructions for configuring for dual-stack.
For example, given the above subnets, the Installation
configuration should be:
apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: calicoNetwork: linuxDataplane: Nftables ipPools: - name: default-ipv4-pool cidr: 10.0.0.0/16 # from ClusterConfiguration.networking.podSubnet encapsulation: None natOutgoing: Enabled - name: default-ipv6-pool cidr: 2001:db8:0:0::/64 # from ClusterConfiguration.networking.podSubnet encapsulation: None natOutgoing: Disabled --- apiVersion: operator.tigera.io/v1 kind: APIServer metadata: name: default spec: {}
- We're assuming that the IPv6 addresses are global scope, so SNAT is disabled; if we chose site-local addresses instead, then we would enable SNAT.
- We're assuming that the Kubernetes nodes—on which BIRD, Calico's BGP server, run—are all connected together on the same ISO Level 2 network, so that the BIRD instances can find each other automatically.
- We need to tell Calico to use
nftables
, consistent with our choice for Kubernetes.
If you have a slow Internet connection, it may take some minutes for the Calico pods to come up, because images need to be downloaded. (This is, of course, true for any new images that you'll be starting.)
You can optionally validate dual-stack networking.
worker node(s)
Before you can add deploy workloads to the cluster, you'll need to add a worker node.
We assume that a machine has been set up according to the “package installation” and “dual-stack” sections above.
The end of the output from the kubeadm init
command above contains
the kubeadm join
command that you should run on the machine that will be the new worker node, e.g.:
kubeadm join control.kubernetes.internal:6443 --token asdf… \ --discovery-token-ca-cert-hash sha256:deadbeef…
The preferred way to use these values is in a
JoinConfiguration
file (e.g., called kubeadm-join.yaml
):
apiVersion: kubeadm.k8s.io/v1beta4 kind: JoinConfiguration discovery: bootstrapToken: apiServerEndpoint: control.kubernetes.internal:6443 token: "asdf…" caCertHashes: - "sha256:deadbeef…" nodeRegistration: kubeletExtraArgs: - name: "node-ip" value: "192.168.4.3,2001:db8:0:4::3"
-
We're going to pretend here that the machine on which we're installing Kubernetes
has IP addresses 192.168.4.3 and 2001:db8:0:4::3,
analogous to what we did in the
InitConfiguration
-
The token is only good for 24 hours (by default).
You can create a new one with
kubeadm token create
on the control node.
Then you can run kubeadm join --config=kubeadm-join.yaml control.kubernetes.internal:6443
.
(Note that the server name, control.kubernetes.internal
, needs to be resolvable on the worker node.
So you might need to add it to /etc/hosts
, DNS, etc.)
machine shutdown
So what happens when the machine on which we're running the whole cluster shuts down?
Debian uses systemd
, so the cri-o
and kubelet
services will be shut down,
and with them all of the workload and control plane containers that run.
upgrading
Kubernetes releases a new minor version about four times per year, with a few patch versions in between. CRI-O follow the Kubernetes minor release cycle. See “Upgrading kubeadm clusters” for the full details. “Skipping MINOR versions when upgrading is unsupported.”
You may wonder why we need to place a hold on the Kubernetes packages.
Couldn't we remove the hold and perform the upgrades automatically, using unattended-upgrades
?
The answer is no, because there are sometimes manual steps required even after the software packages have been upgraded.
Specifically, a kubeadm upgrade apply
will be needed, and sometimes configuration API may change.
next steps
- Adopt a zero trust network model. If you installed Calico via its Operator, as above, then you won't need to install the Calico CSI driver, as it was installed by the Operator. When installing Istio, the Helm chart is the preferred means.
- Add a PersistentVolume. While you can add workloads to the cluster as-is, those workloads will need to be self-contained images serving up static content. In order to support anything that requires persistent data, you'll need to support PersistentVolumeClaims. You can start with a simple hostPath PersistentVolume, but for a production-grade cluster you'll ultimately need a distributed storage management solution like DRBD or Ceph/Rook.