Introducing our new post series! Throughout June, we will guide you through some troubleshooting sessions related to DevOps where we had specific problems to solve either for our clients or on our internal stack.
In this post, we will explain what are the steps and options to check when it comes to establishing network connectivity between services running in EKS Kubernetes cluster and external services.
One of the typical use cases that we often encounter is that many applications deployed in k8s clusters need external connectivity to services that are outside that same cluster (communication between services within k8s cluster is a topic in itself, as it covers areas such as service objects, namespaces separations, networking policies and many more).
Those services are typically databases and APIs, although they can be any type of service. The EKS setup that we recommend to our clients consists of a minimum 3 worker nodes, each deployed in separate AZs (Availability Zones) and in so-called private subnets. This means that the worker nodes do not have direct access to the public internet via public or Elastic IPs. Instead, all outgoing traffic from those nodes towards the external/public internet need to go through a NAT gateway in respective VPC. One can also choose to have multiple NAT gateways (e.g. one per subnet), but in general, this would be overkill and also it does not help with managing outgoing public IP addresses. If a single NAT gateway is used for all outgoing traffic, it means that all outgoing traffic will be presented with one public IP address (NAT gateway IP address). This allows 3rd party vendors and services to whitelist/allow only that specific IP address in their firewalls.
Beside this, they might have other types of protection (e.g. asking applications to present an access token), but having a whitelist of NAT IP is mandatory.
From EKS/k8s side there are few things that need to happen to allow outgoing traffic - the security group attached to worker nodes needs to have outgoing rules that allow it (e.g. allow only outgoing traffic over port 443 or allows all outgoing traffic); networking policies (if any) also need to have "egress" rule that allow outgoing traffic towards the "world" or specific IP blocks, ports, and protocols.
Here is an example:
The client is running a specific application in EKS/k8s clusters within the 'production' namespace, and this app (beside other functions) requires access to an external MSSQL database operated by a 3rd party vendor. In this scenario, we would share the public IP of the NAT GTW with the vendor, enabling them to configure appropriate firewall rules on their side to allow access from our cluster. At this stage, we are focusing solely on establishing network connectivity, without considering authentication or credentials.
To enable the application to successfully connect to the external database, several configurations must be in place. The AWS security group associated with the EKS cluster should have a firewall rule allowing outgoing traffic over port 1433 (the default port for MSSQL). Additionally, the Kubernetes networking policy used by the application should include an egress entry that allows pods to communicate outside the cluster over port 1433.
If the application fails to establish a successful connection with the external database, several checks need to be performed. For testing purposes, we recommend setting up a test or example pod within the same namespace as the application pod. This test pod should include basic Linux networking tools such as ping, telnet, traceroute, and curl/wget. From this test pod, you can perform the following checks: ensure that the database domain (if applicable) is resolvable from your pods and verify if you can establish a telnet connection to the destination IP and port.
If these checks fail, further investigation is required. You should examine the networking policies, confirm that the NAT gateway IP provided to the third-party vendor is correct (you can verify this by performing a ping from your pod to a public IP of a server that you have access to, such as an EC2 instance, and use tcpdump to check the incoming IP address for ICMP/pings from your pod). Additionally, double-check that the security group is correctly configured to allow outgoing traffic. Finally, confirm with your vendor that they have whitelisted your traffic correctly.
If all the above checks are correct and issues still persist, it is likely that the vendor has additional security measures in place that are blocking your traffic.
Other examples may involve your application needing to communicate with an external API service over port 443 or perform uploads and downloads, such as using the SFTP protocol for large files. Troubleshooting these scenarios follows the same pattern as described in the first example. The general approach is to start from your pod and trace the flow through the entire process. It's important to keep in mind that there are often layered security implementations on the other side, so you must ensure that your configurations are correct. Additionally, providing valid proof to external parties can help reduce the time they spend troubleshooting their end. For instance, sharing .pcap files from Wireshark, which demonstrate that packets have left your network but are being dropped on the other side, can be sufficient evidence.
While networking has evolved with the rise of public clouds, the main principles remain intact. Many times, even when running fully on a public cloud, you will still need to engage in classic troubleshooting sessions. Therefore, we encourage all our employees to have a solid networking background. The principles of the OSI model will remain relevant, regardless of the extent of automation and the level of control we entrust to public clouds.
If you would like to hear more on how we deal with more complex networking setups reach out to hello@devolut.io, and make sure to stay up to date with the latest posts from our series!