Fixing network issues with a Raspberry Pi and Kubernetes
In this post, I am going to cover how to replace dhcpcd with systemd to resolve some no route to host errors.
In this post, I am going to cover how to replace dhcpcd
with systemd
to resolve some no route to host
errors.
My cluster has been having issues with networking, usually the issues would go away, but this time it took my cluster down.
The issue I was having is that pods could not communicate with the control plane or pods on any other node. I was getting a lot of no route to host
errors in my pods. This would cause things like the kubernetes-dashboard
to not start or randomly crash, or nginx
not being able to talk to certain pods or any number of other problems.
I use the Calico CNI inside of Kubernetes. Each node could talk to each other and they could ping the pods on the same node. However, pods could not talk to pods on any other node with the same no route to host
error message. Neither curl
nor ping
would work, just the same no route to host
error message.
Further baffling me, was the following messages in my logs on all of my nodes
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali6376b8bb8ab: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali6376b8bb8ab: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali1eafad21ac5: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali1eafad21ac5: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali0e5d8ea32b6: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali0e5d8ea32b6: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali7cbafc8827c: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali7cbafc8827c: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: calice0e46d0038: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: calice0e46d0038: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: calib4841ccaf68: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: calib4841ccaf68: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali221dca70657: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali221dca70657: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali53a6a2bcd9a: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali53a6a2bcd9a: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: calic1dc03e67d4: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: calic1dc03e67d4: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali049b4be5258: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali049b4be5258: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali2522ec4beed: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali2522ec4beed: adding route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali859fc8c2dd6: pid -2118670846 deleted route to 169.254.0.0/16
Feb 24 13:22:51 kube1w1 dhcpcd[555]: cali859fc8c2dd6: adding route to 169.254.0.0/16
Where was it getting that 169.254.0.0/16
from? A quick ip a
showed that all of my cali
interfaces all had a 169.254.0.0/16
address. The link local address is assigned when a dhcp
server does not respond to a request. This address was being assigned by dhcpcd
.
After a discussion with my brother, he suggested trying to remove dhcpcd
and use systemd-network
instead.
Replacing it fixed the problem. There were probably some options I could have set, like disabling the link local part of dhcpcd
, but I chose the nuclear option. Mostly because I wanted to learn something new and felt like it was the better option.
The following steps need to happen:
- Get your current, static IP.
- Configure
systemd
- Disable
dhcpcd
- Reboot
First, let's get your interface name, IP address and subnet mask using ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether dc:a6:32:eb:3a:12 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.217/24 brd 192.168.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fded::dea6:32ff:feeb:3a12/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 86396sec preferred_lft 14396sec
inet6 2601:681:4400:abe:dea6:32ff:feeb:3a12/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 86396sec preferred_lft 14396sec
inet6 fe80::dea6:32ff:feeb:3a12/64 scope link
valid_lft forever preferred_lft forever
Take note of it, in the example above it is 192.168.0.217/24
and my ethernet interface is eth0
.
Next, we need to create the systemd
network file.
The name and location of this file will be in /etc/systemd/network
and the name should be 10-example.network
. The number can be anything you want, example
can be anything you want, but, the filename must end with .network
. The contents will look something like this.
[Match]
Name=eth0
[Network]
DHCP=no
Address=192.168.0.217/24
Gateway=192.168.0.1
DNS=192.168.0.1
Domains=example.lan
Change eth0
to be the name of your interface and the other values to ones matching your environment.
Next, we need to configure resolved so it builds the resolv.conf
file for us.
Edit the file /etc/systemd/resolved.conf
and add the following to the bottom
LLMNR=no
DNSStubListener=no
Now that we have the systemd
unit file and resolved
configured we need to enable them and the timesyncd
service. If you're managing your time synchronization outside of dhcpcd
then you don't need to worry about enabling timesyncd
.
sudo systemctl enable systemd-networkd.service
sudo systemctl enable systemd-resolved.service
sudo systemctl enable systemd-timesyncd.service
Next, disable dhcpcd
.
sudo systemctl disable dhcpcd.service
Now, time to link resolv.conf
and point it to the systemd-resolved
version.
sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf
Then, reboot. If all worked correctly, you should be able to still get in to your PI when it comes back up.
Script
Since I have 9 nodes in my cluster that I needed to do this to, I opted to write a script to do it for me. You will probably need to tweak it for your environment, so do not just run it blindly. My script expects the current IP address to be in the 192.168.0.0/24
subnet, a gateway of 192.168.0.1
and a DNS server of 192.168.0.190
.
The script takes 2 parameters, -h
which is the DNS name of the node, example would be kube1w1.example.com
and -n
is the name of the node in Kubernetes. Example would be kube1w1
.
It will drain the node, because it will get rebooted. SSH into the node, get the current IPv4 address (I am only dealing with static IPv4 with my cluster). It then creates the systemd
network unit, followed by configuring the resolved
service. Enables systemd
and disables dhcpcd
. Followed by rebuilding of the resolv.conf
and rebooting the node. After the ping stops, press ctrl+c and it will reenable the node in Kubernetes.
The script isn't perfect. It didn't need to be, not for this. It worked, and it worked well.
#!/bin/bash
while getopts h:n: option
do
case $option
in
h) HOST=${OPTARG};;
n) HOSTNAME=${OPTARG};;
esac
done
echo Draining Kubernetes
kubectl drain $HOSTNAME --ignore-daemonsets --delete-emptydir-data
echo Getting the current IP address
IP=`ssh $HOST ip -4 a show dev eth0 | grep inet | grep -o '192\.168\.0\.[0-9][0-9][0-9]/24' | grep -o '192\.168\.0\.[0-9]*'`
echo Configuring $HOST to use $IP
cat <<EOF | ssh $HOST sudo tee /etc/systemd/network/10-example.network
[Match]
Name=eth0
[Network]
DHCP=no
Address=$IP/24
Gateway=192.168.0.1
DNS=192.168.0.190
Domains=example.com
EOF
echo Configuring resolved
cat <<EOF | ssh $HOST sudo tee /etc/systemd/resolved.conf
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See resolved.conf(5) for details
[Resolve]
# Some examples of DNS servers which may be used for DNS= and FallbackDNS=:
# Cloudflare: 1.1.1.1 1.0.0.1 2606:4700:4700::1111 2606:4700:4700::1001
# Google: 8.8.8.8 8.8.4.4 2001:4860:4860::8888 2001:4860:4860::8844
# Quad9: 9.9.9.9 2620:fe::fe
#DNS=
#FallbackDNS=
#Domains=
#DNSSEC=no
#DNSOverTLS=no
#MulticastDNS=yes
#LLMNR=yes
#Cache=yes
#DNSStubListener=yes
#DNSStubListenerExtra=
#ReadEtcHosts=yes
#ResolveUnicastSingleLabel=no
LLMNR=no
DNSStubListener=no
EOF
echo Enabling services
ssh $HOST sudo systemctl enable systemd-networkd.service
ssh $HOST sudo systemctl enable systemd-resolved.service
ssh $HOST sudo systemctl enable systemd-timesyncd.service
echo Disabled dhcpd
ssh $HOST sudo systemctl disable dhcpcd.service
echo Reconfiguring resolve.conf and rebooting
ssh $HOST 'sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf && sudo reboot'
ping $HOST
echo Reenabling node
kubectl uncordon $HOSTNAME
Conclusion
This was a very annoying problem that has plagued me for a while, sometimes. Rebooting the node with the broken pod would almost always fix it for a while. There were times where it was perfectly fine though. Probably because Calico was cleaning up after dhcpcd
and getting rid of bogus routes which dhcpcd
would then end up restoring or something.
Anyways, thanks go to William Cooke for his assistance with pointing me down a good path in configuring networking on Debian on Raspberry Pi's.
Links

