Troubleshooting failure of Kubernetes add-ons and auxiliary services

In our blog post related to Kubernetes base setup, on LI and our blog page, we talked about a set of add-ons and additional services that are either required or optional in order for k8s to function properly and deliver the best outcomes.

But what happens if some of those components start failing, how do you troubleshoot them, what are the consequences of those services not working and how might you fix them?

In this article, we will go through a few examples.

Let's start from the add-ons, list of some of those can be found:

https://kubernetes.io/docs/concepts/cluster-administration/addons/
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons

We will mention/go through a quick t-shoot guide for next add-ons:

- Cilium (networking addons)

- coreDNS

Cilium

Cilium is a networking, observability, and security solution with an eBPF-based dataplane. It provides a simple, flat Layer 3 network with the ability to span multiple clusters in either a native routing or overlay mode. It is Layer 7 protocol-aware and can enforce different sets of policies on L3-L7 using an identity-based security model that is decoupled from network addressing. Although there might be many issues with Cilium some of the typical ones are:
- issues with components health
  ###In case a Cilium encounters a problem that it cannot recover from, it will automatically report the failure state via cilium status which is regularly queried by the Kubernetes liveness probe to automatically restart Cilium pods. If a Cilium pod is in state CrashLoopBackoff then this indicates a permanent failure scenario
  
  If a particular Cilium pod is not in a running state, the status and health of the agent on that node can be retrieved by running cilium status in the context of that pod, for example:
  
  kubectl -n kube-system exec cilium-example-pod -- cilium status###
- Connectivity Problems (e.g. node to node traffic is being dropped)
  ###If Endpoint-to-endpoint communication on a single node succeeds but communication fails between endpoints across multiple nodes you have multiple options to troubleshoot this. Usually, you would want to run the cilium monitor command one the node of the source and destination and look for packet drops###

Check the official documentation for an extensive list of troubleshooting cases:

CoreDNS

CoreDNS is a component that provides naming services for the cluster and is characterized by higher efficiency and smaller resource occupancy. In general, it is recommended to use CoreDNS instead of kube-dns to provide DNS services for the cluster.

The first step when troubleshooting CoreDNS issues is to check connectivity between you control plane & worker nodes and CoreDNS, it is exposed on default IP 10.96.0.10, so you can do a simple telnet from nodes:

      $ telnet 10.96.0.10 53  
      Trying 10.96.0.10...  
      Connected to 10.96.0.10.  
      Escape character is '^]'.

If this works, the next thing you might want to try is to deploy a Network Multitool Container from which you would be trying DNS resolutions/nslookup.

On how to setup this check https://hub.docker.com/r/wbitt/network-multitool

As with any other services, you would want to check status of CoreDNS pods, examine logs, and pay attention to log messages - if there are no error messages you would need to verify that the DNS service is up (using commands like ‘kubectl get svc -A -l k8s-app=core-dns’).

If this is true, the next thing to verify is if DNS requests are actually received by CoreDNS pods. To check this, we would enable the 'log' plugin by editing the ConfigMap of coreDNS and adding a 'log' line in Corefile section, like this:

	 apiVersion: v1
	data:  
	  Corefile: |  
	    .:53 {  
	             log
	             errors  
	             health {  
	                 lameduck 5s  
	}  
	...

if CoreDNS pods are receiving the queries, you should see them in the logs, for example:

[INFO] 10.244.1.4:33167 - 34605 "A IN google.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 126 0.001959714s

Also, don't forget to check CoreDNS permissions - it should be able to list service and endpoint-related resources to properly resolve service names. If it doesn’t have sufficient permissions, you will see error messages like this:

2023-06-26T07:12:15.699431183Z [INFO] 10.96.144.227:52299 - 3686 "A IN serverproxytest.devolut.net.cluster.local. udp 52 false 512" SERVFAIL qr,aa,rd 145 0.000091221s

In order to check mentioned permissions, you can get the current ClusterRole of system:coredns:

If any permissions are missing, you can edit the ClusterRole to add the missing permissions:

kubectl edit clusterrole system:coredns -n kube-system

Example insertion of EndpointSlices permissions:

	...  
	- apiGroups:  
	- discovery.k8s.io  
	resources:  
	- endpointslices  
	verbs:  
	- list  
	- watch  
	...

If you have tried all the above troubleshooting methods and are still not able to solve your issue, you can try a different network add-on.

Our engineering team consists of certified k8s administrators who are experts at setting up, maintaining Kubernetes clusters, and troubleshooting various sorts of issues related to it. If you are interested in how Kubernetes can help your organization scale faster while adding additional layers of availability and security, reach out to hello@devolut.io!

Troubleshooting failure of Kubernetes add-ons and auxiliary services

Cilium

CoreDNS

Phone:

Email: