Sunday, November 17, 2024

notes on installing Kubernetes on Debian

Not instructions per se, but a guide to the instructions plus thoughts on choices that arise in installing Kubernetes onto a Debian machine.

I began writing this posting as specific instructions on how to install a specific version of Kubernetes onto a specific version of Debian. In the process, I realized that there isn't a one-size-fits-all approach to installing Kubernetes. So these are the notes that remain from what I learned that are still generally applicable. Most of their value is that they linearize the thought process of installing Kubernetes from scratch.

The Kubernetes website has good, thorough instructions on how to install Kubernetes on Debian, and its instructions on “Installing kubeadm” include pointers to installing CRI-O as well; those pages should be considered the authority on the subject. (They will be referenced again, below.)

If you're not familiar with the Kubernetes cluster architecture, you should consult that page as needed.

not the quick solution

If you're looking to quickly get Kubernetes running on a small scale, there are faster ways. This post is aimed at the production-grade approach, the kind that a business (e.g., one of my employers or clients, and maybe one of yours) would use. So it's aimed at developing an understanding of how business-grade systems work. If you want the quick solution, here are some that were recommended to me: k3s, RKE2, microk8s, Charmed Kuber­netes, OpenShift.

caveats

These are notes that I made while going through the process for the first time. I haven't tested them a second time, except inasmuch as I had to repeat parts for the second node. So there may be things that I forgot to document.

These notes are based on Debian 12, which mostly means that systemd is used.

nodes

The first subject to consider is what are the nodes on which you'll install the Kubernetes cluster. Kubernetes separates its design into a control plane and—what might be called a “data plane”—a set of worker nodes.

  • The control plane runs on nodes, and may run all on one node, or replicated across multiple nodes for high availability. There must be at least one control plane node in order for a cluster to exist.
  • There must be at least one worker node in order for the cluster to service workloads.
  • The control plane and workload processes can be mixed together on a node. This requires removing taints on such nodes that would normally prohibit workloads from running there. (See InitConfiguration.nodeRegistration.taints for the setting.) The case of having a single-node cluster with both control plane and workload combined may be useful in development or testing.
  • Extra control plane or worker nodes can be added for redundancy. A common choice in production is to have an odd number (at least three) of control plane nodes; this can help the cluster decide which part is authoritative if the network is partitioned. Distributing the nodes across hardware, racks, power supplies, or data centers can improve the reliability.

There are several ways to place the workload with respect to the control plane. I chose to separate the control plane into nodes that are separate from the worker nodes, in order to protect the control plane services from any workload that may manage to break out of its restraints (Linux namespace and cgroup). I’m starting with one node of each type.

The Kubernetes tools (kubdadm in particular) allow one to reconfigure a cluster so that the initial configuration doesn't matter much.

Xen domUs

If your machines are Xen domUs, you'll want to set vcpus equal to maxcpus in the domU config, because vcpus is what determines the number of vCPUs that appear in /proc/cpuinfo, and thus determines the number of vCPUs that Kubernetes believes to be present. If you over-allocate the vCPUs among the Xen domains, perhaps in order to ensure that they're not underutilized, you can use scheduler weights to affect the priority that each domain has to the CPUs.

For example, with a 24-core, 32-thread Intel CPU, 32 Xen vCPUs would be available and could be allocated thus:

  • dom0: dom0_max_vcpus=1-4
  • control-plane domU: vcpus = 2, maxcpus = 2, cpu_weight = 192
  • worker domU: vcpus = 30, maxcpus = 30, cpu_weight = 128

deployment method

On dedicated hardware, there are several ways to deploy the control plane:

  • A “traditional” deployment uses systemd to run (and keep running) the control plane services. This is a manual configuration. Using kubeadm is preferred because it configures the cluster in a more consistent way; see the next point.
  • The “static pods” deployment, used by kubeadm, lets kubelet manage the life of the control plane services as static pods.
  • The “self-hosted” deployment, in which the cluster itself manages the running of the control plane services. This seems a bit fragile in that a problem in a control plane could cause the whole control plane to fail.

So the way that I prefer for my on-premises hardware is to use kubeadm.

Debian repository limitations

Note that the Debian repository has kubernetes-client, containerd, crun and runc packages, but we're not using these since, as is usual with stable Debian releases, the packages are out-of-date by a year on average, and updates to this code are frequent and often security-related. Also there are restrictions on which version of kubectl can be used with kubelet: usually they should differ by no more than one minor version. Further, the repository doesn't contain the other Kubernetes packages. So simply installing from Debian isn't an option.

package installation

These deb packages need to be installed on every node. The cri-o package's services run containers on the node, and the kubelet service connects the node to the control plane. Some of the control plane is implemented by the running of containers (that perform actions specific to the control plane needs) so all the infrastructure is needed on every node. The kubeadm package is only needed for managing the node's membership in the cluster. The kubectl package may not be strictly necessary on every node, but it helps to have it if proxying isn't working.

The process involves setting up apt to fetch from the Docker and Kubernetes Debian-style repos. These official Debian packages will be needed to do that:

apt install curl gpg

CRI-O

Kubernetes first needs a container runtime that supports CRI. CRI-O is the more modern choice of the popular implementations. You'll first need to check some Linux prerequisites are in place, according to the instructions on the “Container Runtimes” page. The CRI-O project has installation instructions, which include Debian-specific instructions.

Following those instructions, and before running systemctl start crio, it's necessary to remove or move aside CRI-O's *-crio-*.conflist CNI configuration files from /etc/cni/net.d/. We will use Calico's CNI configuration instead, which the Calico operator will install. Then continue with running systemctl start crio. Stop after that; the rest of the instructions are addressed in the next step.

Kubernetes will automatically detect that CRI-O is running, from the presence of its UNIX socket. CRI-O needs an OCI-compatible runtime to which it can delegate the actual running of containers; it comes with crun and is configured to use it by default.

Kubernetes

Continue on by following “Installing kubeadm, kubelet and kubectl”. (You will have done most of this in the previous step. The apt-mark hold is still needed.)

Once all the packages are installed, you're ready to run kubeadm init. But first you'll need to understand how to configure it.

dual-stack IPv4/IPv6

In order to avoid DNAT to any Services published from the Kubernetes cluster and SNAT from any Pods, one would like the Pods and Services to have publicly-routable IPv6 addresses. (IPv4 addresses on the Pods would probably be RFC1918 internal addresses, since IPv4 addresses aren't as abundant, and so would need to be NATed in any case.) The Pods and Services should have both IPv4 and IPv6 addresses, so that they can interact easily with either stack. The IPv6 addresses that we're using here are only for use in documentation. They're presumably global-scope unicast addresses; we could instead use site-local addresses, with the caveat that we'd need to NAT external communication.

This is a choice that needs to be made when the cluster is created; it can't be changed later, nor can the chosen CIDRs (at least not via kubeadm). One subnet of each type is needed for Pods and Services. This of course assumes that you have an IPv6 subnet assigned to you from which you can allocate. When sizing the subnets, mind that every Pod and Service needs an address.

For the examples, we'll use 10.0.0.0/16 and 2001:db8:0:0::/64 for the Pod subnet, and 10.1.0.0/16 and 2001:db8:0:1::/108 for the Service subnet. (The services IPv6 subnet can be no larger than /108. This isn't documented, but the scripts check for it.) For your real-life cluster, I recommend that you choose a random 10. subnet, instead of these above or the 10.96 default, to avoid address collisions when you have access to multiple clusters or private networks in general. It's probably best to choose two consecutive subnets as above, to make firewalling rules shorter; that is, 10.0.0.0/15 and 2001:db8:0:0::/63 cover both of the above pairs of subnets.

Forwarding of IPv4 packets must be enabled for most CNI implementations, but most will do this automatically if needed. For dual-stack, you'll also need to enable IPv6 forwarding; it's not clear whether CNI implementations will also do this automatically. In any case, these settings are required for the kubeadm pre-flight checks to pass. Run this on every node:

sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv6.conf.all.forwarding=1

and create /etc/sysctl.d/k8s.conf:

net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1

pod network plugin (Calico)

Kubernetes requires a pod network plugin to manage the networking between containers. Container networking is designed to scale to an extremely large number of containers, while still presenting a simple OSI layer 2 (Ethernet), or at least a layer 3 (IP), view of the Kubernetes container network. Optimal use of the network avoids using an overlay network (e.g., VXLAN or IPIP) unless necessary (e.g., between nodes). Here we can avoid an overlay because we have full control over the on-prem network. Specifically, we can use an arbitrary subnet for the pods and services, but still route packets across the cluster, or into or out of the cluster, because we can adjust the routing tables as needed.

Calico also supports NetworkPolicys, a desirable feature.

The Pod's DNS Policy is typically the default, ClusterFirst (not Default!); setting hostNetwork: true in the PodSpec would be unusual.

full-mesh BGP routing

Since we have a small (two-node) cluster here, we're going to use a BGP full-mesh routing infrastructure, as provided by Calico. This is the default mode of the Calico configuration. As long as the nodes are all on the same ISO Level 2 network, their BGP servers can find each other and forward packets, without resorting to IPIP encapsulation.

kubeadm init options

The kubeadm init command accepts a (YAML) configuration file with an InitConfiguration and/or a ClusterConfiguration. Let's call it kubeadm-init.yaml:

---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration

clusterName: kubernetes
controlPlaneEndpoint: control.kubernetes.internal

networking:
  dnsDomain: kubernetes.internal
  podSubnet:     10.0.0.0/16,2001:db8:0:0::/64
  serviceSubnet: 10.1.0.0/16,2001:db8:0:1::/108

apiServer:
  extraArgs:
  - name: service-cluster-ip-range
    value: 10.1.0.0/16,2001:db8:0:1::/108  # same as networking.serviceSubnet

controllerManager:
  extraArgs:
  - name: cluster-cidr
    value: 10.0.0.0/16,2001:db8:0:0::/64  # same as networking.podSubnet

  - name: service-cluster-ip-range
    value: 10.1.0.0/16,2001:db8:0:1::/108  # same as networking.serviceSubnet

  - name: node-cidr-mask-size-ipv4
    value: "16"

---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration

nodeRegistration:
  kubeletExtraArgs:
  # Required for dual-stack; defaults only to IPv4 default address
  # Has no KubeletConfiguration parameter, else we'd set it there.
  - name: node-ip
    value: 192.168.4.2,2001:db8:0:4::2

---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

cgroupDriver: systemd
clusterDomain: kubernetes.internal

# Enables the node to work with and use swap.
# See https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory
failSwapOn: false
memorySwap:
  swapBehavior: LimitedSwap

---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration

mode: nftables
clusterCIDR: 10.0.0.0/16,2001:db8:0:0::/64  # same as ClusterConfiguration.networking.podSubnet
  • Although this looks like a YAML file, it doesn't accept all of the YAML 1.2 syntax. If you try to place comments or directives before the first "---", kubeadm will fail to parse it.
  • Note that the convention that I use here is that the DNS domain in which all of this cluster's parts exist is formed from the name of the cluster (here, kubernetes). So the domain is kubernetes.internal, the load-balanced control endpoint is control.kubernetes.internal, the Services subdomain is service.kubernetes.internal, etc.
  • We're going to pretend here that the machine on which we're installing Kubernetes has IP addresses 192.168.4.2 and 2001:db8:0:4::2. In order for the kubelet service on the (worker) node to listen on both the IPv4 and IPv6 addresses of its node, we need to configure InitConfiguration.nodeRegistration.kubeletExtraArgs as above. Else it would only listen on one of those.
  • We're using here the subnets that we chose above.
  • Since we're preparing here for later expansion of the control plane to other nodes, we'll need to specify the ClusterConfiguration.​controlPlaneEndpoint. (This represents the --control-plane-endpoint command-line value.) Easiest is to use DNS round-robin A records, but since I plan to run DNS on the cluster, I'll just add an /etc/hosts entry for now.
  • We'll also need a configuration file for kubelet, which we'll call kubelet.yaml and put in the same directory:
  • In the KubeletConfiguration, we're taking advantage of any swap that may be available to the machine.

kubeadm reset

If the kubeadm init or kubeadm join fails to complete, kubeadm reset can be used to revert most of its effect. The command output spells out what other state needs clean-up.

initial control plane node

The first control plane node is where the Kubernetes cluster is created/initialized. Any additional control plane node will be joined to the cluster, and so would be handled differently.

It's assumed that Debian is already installed on the node, and that there is no firewalling set up on the node. As an option, Kubernetes provides its own mechanism for adding firewalling that's compatible with its services, which we'll add below.

build the cluster

Then we run:

kubeadm init --config=kubeadm-init.yaml

You should use script to capture the command's output when you run it, since the final output includes a token and certificate needed for future steps. If you perform the configuration setup that's described in the command's final output, you can then, as a non-root user, run kubectl cluster-info to verify that the cluster's running. You might also want to enable Bash autocompletion for kubectl.

deploy Calico

Once the cluster is running, and before adding applications, we need to install the network plugin. This has two parts: install the Tigera Kubernetes Operator, and deploy Calico (using the operator). We'll follow the Calico instructions for a self-managed, on-premises cluster. (The same thing could be done with Helm, but that presumes having Helm already installed.) If you've set up the non-root user's ~/.kube/config according to the instructions output by kubeadm init, you can run the instructions as that user.

You will need to customize the custom-resources.yaml file's Installation.spec.calicoNetwork.ipPools to match the IPv4 and IPv6 pools chosen above. Calico has instructions for configuring for dual-stack. For example, given the above subnets, the Installation configuration should be:

apiVersion: operator.tigera.io/v1
kind: Installation

metadata:
  name: default
spec:
  calicoNetwork:
    linuxDataplane: Nftables
    ipPools:
    - name: default-ipv4-pool
      cidr: 10.0.0.0/16  # from ClusterConfiguration.networking.podSubnet
      encapsulation: None
      natOutgoing: Enabled

    - name: default-ipv6-pool
      cidr: 2001:db8:0:0::/64  # from ClusterConfiguration.networking.podSubnet
      encapsulation: None
      natOutgoing: Disabled
---
apiVersion: operator.tigera.io/v1
kind: APIServer

metadata:
  name: default
spec: {}
  • We're assuming that the IPv6 addresses are global scope, so SNAT is disabled; if we chose site-local addresses instead, then we would enable SNAT.
  • We're assuming that the Kubernetes nodes—on which BIRD, Calico's BGP server, run—are all connected together on the same ISO Level 2 network, so that the BIRD instances can find each other automatically.
  • We need to tell Calico to use nftables, consistent with our choice for Kubernetes.

If you have a slow Internet connection, it may take some minutes for the Calico pods to come up, because images need to be downloaded. (This is, of course, true for any new images that you'll be starting.)

You can optionally validate dual-stack networking.

worker node(s)

Before you can add deploy workloads to the cluster, you'll need to add a worker node. We assume that a machine has been set up according to the “package installation” and “dual-stack” sections above. The end of the output from the kubeadm init command above contains the kubeadm join command that you should run on the machine that will be the new worker node, e.g.:

kubeadm join control.kubernetes.internal:6443 --token asdf… \
        --discovery-token-ca-cert-hash sha256:deadbeef…

The preferred way to use these values is in a JoinConfiguration file (e.g., called kubeadm-join.yaml):

apiVersion: kubeadm.k8s.io/v1beta4
kind: JoinConfiguration

discovery:
  bootstrapToken:
    apiServerEndpoint: control.kubernetes.internal:6443
    token: "asdf…"
    caCertHashes:
    - "sha256:deadbeef…"

nodeRegistration:
  kubeletExtraArgs:
  - name: "node-ip"
    value: "192.168.4.3,2001:db8:0:4::3"
  • We're going to pretend here that the machine on which we're installing Kubernetes has IP addresses 192.168.4.3 and 2001:db8:0:4::3, analogous to what we did in the InitConfiguration
  • The token is only good for 24 hours (by default). You can create a new one with kubeadm token create on the control node.

Then you can run kubeadm join --config=kubeadm-join.yaml control.kubernetes.internal:6443. (Note that the server name, control.kubernetes.internal, needs to be resolvable on the worker node. So you might need to add it to /etc/hosts, DNS, etc.)

machine shutdown

So what happens when the machine on which we're running the whole cluster shuts down? Debian uses systemd, so the cri-o and kubelet services will be shut down, and with them all of the workload and control plane containers that run.

upgrading

Kubernetes releases a new minor version about four times per year, with a few patch versions in between. CRI-O follow the Kubernetes minor release cycle. See “Upgrading kubeadm clusters” for the full details. “Skipping MINOR versions when upgrading is unsupported.”

You may wonder why we need to place a hold on the Kubernetes packages. Couldn't we remove the hold and perform the upgrades automatically, using unattended-upgrades? The answer is no, because there are sometimes manual steps required even after the software packages have been upgraded. Specifically, a kubeadm upgrade apply will be needed, and sometimes configuration API may change.

next steps

Tuesday, September 24, 2024

how to boot a Crucial firmware ISO image from a USB stick

Crucial sells NVMe SSD cards, such as the T500 series, for which they provide firmware updates. The latest firmware update (version P8CR004), however, is provided as an ISO image that may be bootable when copied to a CD, but not when copied to a USB stick using something straightforward like:

dd if=T500_P8CR004.iso of=/dev/sd_ bs=4M status=progress oflag=sync

where /dev/sd_ is the USB stick. The USB stick will boot, but only to a GRUB prompt. (It will even boot on a UEFI-only system, despite the Crucial instructions saying that "legacy" or "compatibility" mode is required.) It seems that a GRUB configuration file is missing.

Many sources, including Crucial's own instructions, will tell you to try various installer utilities (Etcher, Universal USB Installer, Rufus, usb-creator, et al.), but none of them performs the job any better than dd. The trick to booting the image is to run the following commands at the GRUB prompt:

set root=(cd0)
linux /boot/vmlinuz64 libata.allow_tpm=1 quiet base loglevel=3 waitusb=10 superuser mse-iso rssd-fw-update rssd-fwdir=/opt/firmware
initrd /boot/corepure64.gz
boot

These arguments come from the isolinux.cfg file, so check there for the exact values. The Micron Storage Executive should run automatically, printing something like:

Setting up Micron Storage Executive... DONE.

Upgrading drive /dev/nvme0 [Serial No. 2222222AAAA2] to P8CR004
..
Device Name  : /dev/nvme0
Firmware Update on /dev/nvme0 Succeeded!
CMD_STATUS   : Success
STATUS_CODE  : 0
TIME_STAMP   : Sun Sep 22 03:42:45 2024
Please remove bootable media. Press any key to continue...

Monday, September 2, 2024

how to install Xen on Debian 12 “bookworm”

A brief guide to installing Xen with a single "PVH" domU on Debian using contemporary hardware.

motivation

There are a few “official” instructions for installing Xen, and many unofficial ones (like this). The “Xen Project Beginners Guide” is a useful introduction to Xen, and even uses Debian as its host OS. But for someone who is already familiar with Xen and simply wants a short set of current instructions, it's too verbose. It even includes basics on installing the Debian OS prior to installing Xen, which is something that I take as a given here. Further, it doesn't address the optimized “PVH” configuration for domUs, which is available for modern hardware. Much of the Xen documentation seems to have last been touched in 2018, when AWS was still using Xen.

The Debian wiki also has a series of pages on “Xen on Debian”, but the writing appears unfocused, speculating about all sorts of alternative approaches one could take. Some useful information can be gleaned from it, but it doesn't have the brevity that I'm looking for here.

The Xen wiki's “Xen Common Problems” page is a good source for various factoids, but not a set of cookbook instructions. Various unofficial instructions can be found on the Web, but I found them to be incomplete for my purposes.

preparation

Xen 4.17 is the current version in Debian 12.6.0; Xen 4.19 was recently released, so the Debian version is probably sufficiently recent for most needs.

VT-d virtualization is required for Intel processors. (I don't address AMD or other chip virtualization standards, but the corresponding technology is required in that case.) In /boot/config-*, one can confirm that CONFIG_INTEL_IOMMU=y for the kernel, and “dmesg | grep DMAR” (in the non-Xen kernel[1]) returns lines like:

ACPI: DMAR 0x00000000… 000050 (v02 INTEL  EDK2     00000002     01000013)
ACPI: Reserving DMAR table memory at [mem 0x…-0x…]
DMAR: Host address width 39
DMAR: DRHD base: 0x000000fed91000 flags: 0x1
…
DMAR-IR: Enabled IRQ remapping in xapic mode
…
DMAR: Intel(R) Virtualization Technology for Directed I/O

so VT-d seems to be working. If VT-d is not available, you may need to enable it in your BIOS settings.

The Xen wiki's “Linux PVH” page has information on PVH mode on Linux, but it hasn’t been updated since 2018 and references Xen 4.11 at the latest. All of the kernel config settings mentioned there are present in the installed kernel, except that CONFIG_PARAVIRT_GUEST doesn’t exist. I assume it was removed.

Xen installation on Debian

start Xen

apt install xen-system-amd64 xen-tools

installs the Xen Debian packages and the tools for creating the domUs. (If you see references that say to install the xen-hypervisor virtual package, know that xen-system-amd64 depends on the latest xen-hypervisor-*-amd64. You may need a different architecture than -amd64.) Rebooting will go into Xen: it will have been added to GRUB as the default kernel.

configure dom0

configure GRUB’s Xen configuration

The Xen command line options are described in /etc/default/grub.d/xen.cfg. The complete documentation is at https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html.

In /etc/default/grub.d/xen.cfg, set:

XEN_OVERRIDE_GRUB_DEFAULT=1

so that GRUB doesn’t whine about Xen being prioritized.

  • By default, dom0 is given access to all vCPUs (we'll assume 32 on this hypothetical hardware) and all memory (64GB, here). It doesn’t need that much. Furthermore, as domUs are started, the dom0 memory is ballooned down in size, so that the dom0 Linux kernel no longer has as much memory as it thought it had at start-up. So the first step is to scale this back: dom0_mem=4G,max:4G. (I've been told that even 1G should suffice for a server.) The fixed size will avoid ballooning at all. Likewise, for the vCPUs: dom0_max_vcpus=1-4.
  • Since we’re not running unsafe software in dom0, we can set xpti off there.
  • Since modern processors support deeper sleep states, we may benefit by enabling Xen to use those states with the cpuidle option.

So in /etc/default/grub.d/xen.cfg, set:

GRUB_CMDLINE_XEN_DEFAULT="dom0_mem=4G,max:4G dom0_max_vcpus=1-4 xpti=dom0=false,domu=true cpuidle"

(There’s no need to change the autoballoon setting in /etc/xen/xl.conf, since "auto" does what’s needed.)

Then run update-grub and reboot.

configure Xen networking

create a xenbr0 bridge

Xen domUs require a bridge in order to attach to the dom0’s network interface. (There are other options, but bridging is the most common.) Following the Xen network configuration wiki page, in /etc/network/interfaces, change:

allow-hotplug eno1
iface eno1 inet static
…

to:

iface eno1 inet manual

auto xenbr0
iface xenbr0 inet static
	bridge_ports eno1
	bridge_waitport 0
	bridge_fd 0
	… # the rest of the original eno1 configuration

(Obviously this is assuming that your primary interface is named eno1, which is typical for an onboard Ethernet NIC.) Run ifdown eno1 before saving this change, and ifup xenbr0 after.

xenbr0 is the default bridge name for the XL networking tools, which is what we’ll use.

about Netfilter bridge filtering

You may see a message in the kernel logs:

bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.

This is of no concern. It gets printed when the bridge module gets loaded, in the process of bringing up xenbr0 for the first time since boot.

It used to be that bridged packets would be sent through Netfilter. This was considered to be confusing, since it required setting up a Netfilter FORWARD rule to accept those packets—something that most people expected automatically with a bridge. The solution was to remove that behavior to a new module (br_netfilter). This message is a remnant reminder of the change, for those who were depending on it. See the kernel patch; it has been this way since 3.18.

configure the dom0 restart behavior

When dom0 is shut down, by default Xen will save the images of all running domUs, in order to restore them on reboot. This takes some time, and disk space for the images, and most likely you'll want to shut down the domUs instead. To configure that, in /etc/default/xendomains, set:

XENDOMAINS_RESTORE=false

and comment out XENDOMAINS_SAVE.

create a PVH domU

create the domU configuration

There isn’t any obvious documentation on using xen-create-image, the usual tool, for creating a specifically PVH Linux guest. So this is your summary.

Edit /etc/xen-tools/xen-tools.conf, which is used to set variables for the Perl scripts that xen-tools uses, to set:

lvm = "server-disk"
pygrub = 1

This assumes that you're using LVM to provide a root file system to the domUs, and the VG for the domUs is named as shown. Then run this (I recommend using script, though a log is created by default):

xen-create-image --hostname=my-domu.internal --verbose \
  --ip=192.168.1.2 --gateway=192.168.1.1 --netmask=255.255.255.0 \
  --memory=60G --maxmem=60G --size=100G --swap=60G

or whatever settings you choose; see the xen-create-image man page for explanations.

The --memory setting can be tuned later to the maximum available memory, if you're not adding any other domU. It's only used to set the memory setting in the /etc/xen/*.cfg file, and can be edited there. Likewise for maxmem. Setting them equal provides the benefit that no memory ballooning will be performed by Xen on the domU, so there will be no surprises while the domU is running and unable to obtain more memory. The available memory for domUs can be found in the free_memory field in the xl info output in the dom0; it may not be precisely what you can use, since there may be some unaccounted-for overhead in starting the domU.

The --size and --swap settings for the root and swap LV partitions can be expanded later if needed, using the LVM tools in the usual way.

Adjust the /etc/xen/*.cfg file by adding:

type     = "pvh"
vcpus    = 4
maxvcpus = 31
xen_platform_pci = 1
cpu_weight = 128

The maxvcpus setting here assumes that 32 vCPUs are available; it leaves one for the exclusive use of the dom0. Four vCPUs should be enough to start the domU quickly. The cpu_weight deprioritizes the domU's CPU usage vs. the dom0's. xl sched-credit2 shows the current weights.

To have the domU automatically start when dom0 starts:

mkdir -p /etc/xen/auto
cd /etc/xen/auto
ln -s ../my-domu.internal.cfg .

(You may see instructions that tell you to symlink the whole /etc/xen/auto directory to /etc/xen. The downside is that will report warnings as Xen tries to parse the non-configuration example files in /etc/xen.)

fix the domU networking configuration

Due to Debian bug #1060394, eth0 is used in /etc/network/interfaces, not enX0. You can mount the domU's LV disk device temporarily in order to correct this.

add software to the domU

Except for the sudo package needed for the next step, the rest of this is optional but are things that I typically do. This requires starting the domU (with xl create) and logging in to it (with xl console, using the root password given at the end of the xen-create-image output).

Edit /etc/apt/sources.list to drop the deb-src and use non-free-firmware. While you're in there, fix the Debian repositories to be the ones that you wanted; there's a bug in xen-create-image in that it doesn't use the repos from dom0. Then:

apt update
apt upgrade
apt install ca-certificates

Now you can edit /etc/apt/sources.list to use https. Then:

apt update
apt install sudo

plus any other tools you find useful (aptitude, curl, emacs, gpg, lsof, man-db, net-tools, unattended-upgrades,…).

add a normal user to the domU

Connecting to the domU can only be done, initially, with xl console, which requires logging in as root, since that’s the only user that we created so far. (The generated root password will have been printed at the end of the xen-create-image output.) xl console appears to have limitations to its terminal emulation, so connection via SSH would be better. An SSH server is already installed. The SSH daemon (by default) prohibits login as root, and anyway it's best to not log in as root, even via the console, so create a normal user that has complete root privileges [2]:

adduser john
adduser john sudo
passwd --lock root

That's all you need to get started with your new domU!


1 The “non-Xen kernel” is the Linux kernel that is installed with a simple Debian installation, that is, a kernel that isn't enhanced with the Xen hypervisor. When the hypervisor is running in the kernel, it hides certain information from the kernel. Most of that information can be found when the hypervisor is running by using “xl dmesg”.

2 Note that once you lock root's password, if you log out of the console without having created the normal user with admin privileges, you will be locked out of the domU. The way to get access again in that case is to shut down the domU, mount its file system, and edit /etc/shadow to remove the “!” at the start of root's password.

Monday, August 5, 2024

testing the fanless system

The fanless system build that I did earlier should keep the CPU cool under load. To test that, I booted the system on the "standard" (text-based) image of Debian Live 12.6.0 and ran S-TUI for about 15 minutes. The resulting graphs are below.

average CPU core frequency
CPU core & package temperatures

S-TUI loads the CPU with work. The CPU clock frequency jumped from its usual 800MHz, breifly to 3GHz to accomodate this, but then settled down to 2GHz. The cores remained below 60℃. So either the cooling system of the Streacom case is doing its job well, or the CPU's logic is already protecting the chip from overheating. Either way, I'm satisfied.

Friday, July 19, 2024

a 2024 fanless, server-grade system build

I spec and assemble components for a fanless, server-grade home computer, using the technology available in 2024. The technology will likely be relevant for a few years beyond that.

motivation

Off-the-shelf retail computer systems that are aimed at being competitive in performance are often built with cooling fans, in order to run CPUs at TDP levels that would otherwise not be possible. Fans, however, turn the box into an air filter, with dust that is present in the home catching on fan blades, heat sink fins, and electrical components. There are fanless designs, but these seem to assume that performance isn't important, and specifically that the system isn't going to be used in a server role, where 24×7 use demands ECC support at the motherboard level.

I'll also want 10GbE port capability at some point, since my ISP will soon be providing 10Gb service.

components

To fill the gap in the off-the-shelf offerings, I assembled the following system from these components:

componentprice
Intel Core i9-14900T processor $650
Supermicro X13SAZ-F motherboard $600
Streacom FC9 V2 case, HT4 riser, Nano160 PSU $560
2 × Crucial MTC20C2085S1EC56BR 32GB ECC DDR5-5600 UDIMM $325

These 2024 prices are in USD, approximate, with shipping and California taxes.

The FC9 provides up to 87W of conductive cooling via heat pipes to its external wall. The low-power “T” variant of the CPU produces 35W base TDP and up to 106W. The motherboard only provides two 1GbE ports, but both the case and motherboard support the ability to add a 10GbE PCI card later.

Intel low-power CPUs

The “T” variant of the Intel CPUs are difficult to find. I bought mine from Quiet PC Ltd. in the UK. Intel only sells them to OEMs, and those in the U.S. are generally large companies, like Dell, that are only interested in selling complete systems, not in selling components to hobbyists.

But the low-power variant is technically required: The X13SAZ-F motherboard only supports a TDP up to 125W. The FC9 case only supports a TDP of 87W. I have read opinions that the vanilla Intel CPUs can be tuned to have the same characteristics as the “T” variant, but it requires a level of settings fiddling and undervolting that I'm not comfortable attempting and that would take too much time to learn.

I would have preferred to buy the i9-13900T, since its specs are about the same and it should cost about $120 less, but I was unable to find any after Quiet PC Ltd. recently ran out of stock.

power supply

The CPU power supply (JPV1) is an 8-pin socket capable of supplying 360W of power to the motherboard, particularly when the 24-pin JPW1 is not being used. The Nano160 PSU, however, only has a 4-pin plug for this socket, which is only physically capable of supplying 180W, and which is further limited by the 160W maximum of the PSU. Since the CPU power usage should not exceed about 106W, these limitations should be OK. The 4-pin plug simply fills only half of the 8-pin socket.

assembly

tools

Assembly is straightforward, requiring only a Phillips screwdriver and thermal paste. A screwdriver for delicate applications, one with a narrow shaft, is preferred over one for heavier usage, so that it doesn't scratch the case. A 5mm wrench for lightly tightening the standoffs, and a 10mm wrench for tightening the PSU socket to the case, are useful. A headband magnifier is advisable, for inspecting the tiny pins of the CPU or RAM.

The FC9 and HT4 each come with some of Streacom's thermal paste—which I did use for the IHS-HT4 interface—but I think you'll find that the total amount that they give you is difficult to stretch. The heat pipes have some irregularities and the HT4 Upper Mount has some texture that you'll want to fill.

user guides

Assembling the system requires reading the Streacom FC9 and HT4, and the X13SAZ-F motherboard manuals, all of which are online, not packaged with the components. Intel has a reassuring video on LGA 1700 installation that you should watch, too. When installing the CPU or RAM, check for dust on the contacts, and blow it away if present.

PSU

Contrary to the FC9 documentation, the PSU needs to be installed before the CPU heatsink, because it sits below the heat pipes on this motherboard layout. The 4-pin PSU plug can only fit into the half of JPV1 nearest JPW1. Bend the black and yellow loops of wire of the 4-pin plug, as shown below, so that they avoid the heat pipes above them; the longer black and yellow wires that lead to the plug can be positioned between the PSU and the RAM. Also attach the disk drive extension to the PSU card at this point.

the Nano160 PSU board, installed

CPU cooler

positioning

Because the CPU on this motherboard is located closer to the front of the case than usual for Streacom's design, Streacom advises a non-standard configuration of the heat pipes that allow one of the longer pipes to fit but allows for only three pipes in total. Since I'm using the HT4 riser, however, I'm able to utilize all four pipes in a configuration that has one of the longer pipes inverted, lower, and pointing towards the rear of the case instead of the front. Using the heat pipe numbering in the FC9 manual:

  1. short, straight
  2. short, bent
  3. long, bent
  4. long, straight

I instead order the pipes: ①, ②, ④, ③, with ③ being inverted.

required heat pipe order and orientations

Note that the vertical pipes on the HT4 are positioned towards the side that the FC9 heat pipes are on. Streacom recommends this so that there is more proximity and contact between the sets of pipes. This does place the HT4 pipes closer to the RAM chips, but there seems to be adequate space between them.

You'll also notice that the SH2 heat pipes only overlap the HT4 ones by half the possible length. Streacom thinks this is adequate. There is a LH4 set of longer heat pipes that you could use instead for more overlap if you feel this is necessary.

Once the cooler is assembled as shown above to test the positioning, ensure that all the screws on the cooler are tightened:

  • the three on the Heatsink Connector Blocks,
  • the four on the HT4 Lower Mount, especially paying attention that the Adjustable Arms are immovable, and
  • the four Spring Loaded Screws.

We'll leave the HT4 Upper Mount off for now. Lower the drive rack, to ensure that its edge fits between the heat pipes! (See the photo below.)

One thing to note about the inverted pipe: The heat pipes are copper tubes filled with some water in a vacuum. Ideally, the cooling end (the one against the case) should be elevated, so that the water can condense and flow back to the CPU. In the inverted pipe, the water will instead tend to pool at the cooling end, so it won't be available to aid in the heat transfer. Still, the copper of the pipe will provide some conduction, so it's better than not using it.

other approaches

Do not cut the heat pipes. The idea crossed my mind as a way of shortening the ones pointing towards the front; but doing so would ruin their effectiveness. Maybe bending the ends would be possible, but you're on your own there. I'd be afraid that heating the copper enough to bend it easily could cause the water to explode it.

thermal paste

As mentioned above, the Streacom TX13 paste works fine in the IHS-HT4 interface, because there's no need to spread it: just apply a reasonable amount and squish. For the heat pipes, though, I chose ProlimaTech PK-2: it has properties like TX13, but comes in a larger size, in a tube that dispenses more easily. It's electrically non-conductive and only slightly more fluid, so it doesn't run on vertical surfaces like the case wall. The viscosity of the paste also seems to correlate with thermal conductivity, another desirable property. I also don't buy the “reduced waste” greenwashing of the TX13 product when it takes 20 of their little 0.25g packets to match one 5g tube of PK-2.

  • Remove the four Spring Loaded Screws, and the HT4 component. Apply some TX13 to the IHS (as directed in the HT4 manual). Secure the HT4 again in its previous position with the Spring Loaded Screws.
  • For each Heatsink Connector Block in turn, starting low and working upwards:
    • Remove the pipe(s).
    • Apply thermal paste to (a) the extent of the HT4 pipe(s) that make contact with the SH2 pipe(s), and (b) both sides of the SH2 pipe(s) that will be under the Heatsink Connector Block.
    • Reattach the Heatsink Connector Block.
  • Apply thermal paste to the tops of the SH2 pipes where they'll be covered by the HT4 Upper Mount. Reattach the HT4 Upper Mount.
the completed heat sink

tolerances

There is very little space between the heat pipe and the Nano160 PSU board.

proximity of the SH2 heat pipe to the corner of the Nano160 PSU board

When the disk rack is lowered into place, there is around a millimeter gap between it and the HT4.

proximity of the drive tray edge to the top of the HT4

The rack drops very fortunately between the heat pipes. Ensure that this happens intentionally for you! It was accidental in my case. It even shaved a little copper off the pipe as I lowered it. The rack edge makes contact with that heat pipe, for a little extra cooling, I guess.

contact between the drive tray edge and an SH2 heat pipe

The rack presses against the USB port cable. It digs in a little, but nothing of concern.

contact between the drive tray edge and the USB cable

All in all, a very fortunate occurence.

next steps

I followed this up by running Memtest86+, just to test that the memory works and to perform a basic test on the CPU and motherboard. The maximum CPU temperature during the test was 43℃; most of the heat seemed to be coming from the memory chips. I'll need to install disks, and install Debian Linux, testing the disks for errors.

But after that, I'll want to run performance tests on the system, to see that it actually stays cool and fast at the same time. This will probably include capping the maximum sustained heat output of the CPU by setting the PL2 in the BIOS settings to 87W. Stay tuned for that blog post.

Wednesday, September 23, 2015

reducing memory usage in a graph structure using Scala

A straightforward implementation of Ukkonen's suffix tree algorithm in Scala is optimized to reduce memory usage, by using a custom, open addressing hash map implementation, along with other tricks to minimize the number of object instances.