In our blog post related to Kubernetes base setup, on LI and our blog page, we talked about a set of add-ons and additional services that are either required or optional in order for k8s to function properly and deliver the best outcomes.
But what happens if some of those components start failing, how do you troubleshoot them, what are the consequences of those services not working and how might you fix them?
In this article, we will go through a few examples.
Let's start from the add-ons, list of some of those can be found:
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons
We will mention/go through a quick t-shoot guide for next add-ons:
- Cilium (networking addons)
- coreDNS
Cilium
Cilium is a networking, observability, and security solution with an eBPF-based dataplane. It provides a simple, flat Layer 3 network with the ability to span multiple clusters in either a native routing or overlay mode. It is Layer 7 protocol-aware and can enforce different sets of policies on L3-L7 using an identity-based security model that is decoupled from network addressing. Although there might be many issues with Cilium some of the typical ones are:
issues with components health
###In case a Cilium encounters a problem that it cannot recover from, it will automatically report the failure state viacilium status
which is regularly queried by the Kubernetes liveness probe to automatically restart Cilium pods. If a Cilium pod is in stateCrashLoopBackoff
then this indicates a permanent failure scenario
If a particular Cilium pod is not in a running state, the status and health of the agent on that node can be retrieved by runningcilium status
in the context of that pod, for example:
kubectl -n kube-system exec cilium-example-pod -- cilium status###Connectivity Problems (e.g. node to node traffic is being dropped)
###If Endpoint-to-endpoint communication on a single node succeeds but communication fails between endpoints across multiple nodes you have multiple options to troubleshoot this. Usually, you would want to run thecilium monitor
command one the node of the source and destination and look for packet drops###
Check the official documentation for an extensive list of troubleshooting cases:
CoreDNS
CoreDNS is a component that provides naming services for the cluster and is characterized by higher efficiency and smaller resource occupancy. In general, it is recommended to use CoreDNS
instead of kube-dns
to provide DNS services for the cluster.
The first step when troubleshooting CoreDNS issues is to check connectivity between you control plane & worker nodes and CoreDNS, it is exposed on default IP 10.96.0.10, so you can do a simple telnet from nodes:
$ telnet 10.96.0.10 53
Trying 10.96.0.10...
Connected to 10.96.0.10.
Escape character is '^]'.
If this works, the next thing you might want to try is to deploy a Network Multitool Container from which you would be trying DNS resolutions/nslookup.
On how to setup this check https://hub.docker.com/r/wbitt/network-multitool
As with any other services, you would want to check status of CoreDNS pods, examine logs, and pay attention to log messages - if there are no error messages you would need to verify that the DNS service is up (using commands like ‘kubectl get svc -A -l k8s-app=core-dns’).
If this is true, the next thing to verify is if DNS requests are actually received by CoreDNS pods. To check this, we would enable the 'log' plugin by editing the ConfigMap of coreDNS and adding a 'log' line in Corefile section, like this:
apiVersion: v1
data:
Corefile: |
.:53 {
log
errors
health {
lameduck 5s
}
...
if CoreDNS
pods are receiving the queries, you should see them in the logs, for example:
[INFO] 10.244.1.4:33167 - 34605 "A IN google.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 126 0.001959714s
Also, don't forget to check CoreDNS permissions - it should be able to list service and endpoint-related resources to properly resolve service names. If it doesn’t have sufficient permissions, you will see error messages like this:
2023-06-26T07:12:15.699431183Z [INFO] 10.96.144.227:52299 - 3686 "A IN serverproxytest.devolut.net.cluster.local. udp 52 false 512" SERVFAIL qr,aa,rd 145 0.000091221s
In order to check mentioned permissions, you can get the current ClusterRole of system:coredns
:
If any permissions are missing, you can edit the ClusterRole
to add the missing permissions:
kubectl edit clusterrole system:coredns -n kube-system
Example insertion of EndpointSlices permissions:
...
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
...
If you have tried all the above troubleshooting methods and are still not able to solve your issue, you can try a different network add-on.
Our engineering team consists of certified k8s administrators who are experts at setting up, maintaining Kubernetes clusters, and troubleshooting various sorts of issues related to it. If you are interested in how Kubernetes can help your organization scale faster while adding additional layers of availability and security, reach out to hello@devolut.io!