Devolution Blog

Troubleshooting Slow Performance in AWS Cloud Infrastructure and EKS (managed AWS Kubernetes)

Despite the scalability and robustness of cloud infrastructure, you may still encounter slow performance issues. In an AWS (Amazon Web Services) environment running Kubernetes (EKS), these issues can be due to a variety of reasons, from inadequate resource allocation to network bottlenecks. Here are some typical reasons for slow performance and what we as a DevOps typically investigate when troubleshooting:

**1. Insufficient or improper resource allocation**

Insufficient resources (CPU, memory, storage, and network) can seriously hamper the performance of applications. If EC2 instances or EKS pods frequently hit their resource limits, this could be a sign that more resources need to be allocated.

Also, it's important to select the right instance type for your workload. For example, CPU-intensive workloads should be on CPU-optimized instances, while memory-intensive workloads should be on memory-optimized instances.

Another point to consider is the recent discussions about whether or not you should use CPU limits for your apps https://home.robusta.dev/blog/stop-using-cpu-limits vs https://dnastacio.medium.com/why-you-should-keep-using-cpu-limits-on-kubernetes-60c4e50dfc61

At Devolut, we set CPU/RAM requests and limits via Helm charts (we often recommend/develop generic Helm charts that have k8s resources for all your application, and then via a small values file, we control what application is using/templating for its final Helm release).

**2. Network issues**

Cloud performance can be significantly impacted by network latencies, packet losses, or bandwidth limitations. For instance, if your application relies heavily on data transfer between different availability zones (AZs), inter-AZ data transfer costs and latency might be a bottleneck.

On connectivity/networking issues please read our blog https://www.devolut.io/blog/setting-connectivity-and-troubleshooting-networking-issues-between-services-running-in-aws-eks-and-external

**3. Storage I/O Performance**

The I/O performance of your storage solution can significantly impact the overall performance of your application. AWS offers several types of EBS volumes with varying performance characteristics. If you're seeing slow disk I/O, consider whether a higher performance (but also higher cost) storage type may be needed. Also make sure what you are paying for right storage and that you don't overprovision your disk capacity since you might end up paying for space you don't use.

**4. Misconfiguration**

Misconfigurations can significantly impact the performance of AWS and Kubernetes services. For example, if you have an auto-scaling group but have not configured it properly, it could lead to performance issues - it could scale up or down worker nodes at moments when it is not optimal or even dangerous for your system, so please, pay attention to how your scaling groups and their parameters are configured. Similarly, Kubernetes has many tunable parameters, like requests and limits, that, if misconfigured, can lead to poor performance.

**5. Inefficient application code**

Sometimes, the application code can be the source of performance problems. Inefficient algorithms, a lack of caching, and unnecessary data processing can make your application slow. Regular code profiling and performance testing can help detect such issues. Here, you would need to work with application developers to identify and resolve issues like this, as DevOps, we should provide good tooling and frameworks which can help detect app problems.

**6. Cloud service limits**

AWS applies certain limits to its services (as does any other cloud provider). For instance, there are limits on the number of EC2 instances you can launch or the number of EBS volumes that you can use, and so on. If you hit these limits, your application's performance can be throttled. For the majority of services, these limits can be increased, which would require you to contact (AWS) support through their support center and ask for an increase. A good practice (in order to avoid hitting limits in production and then wait for the cloud provider to increase it) is to monitor actual vs allocated usage and set alerts when threshold is reached. One example is the number of Elastic IP addresses, e.g. you are currently using 25 of them and the limit is e.g. set to 50, so once you start spinning more new services that require a public IP and you get close to 50, you should have an alert to notify about this, based on which you can request a quota increase.

**7. Multi-tenancy interference**

Cloud environments are typically multi-tenant, meaning other users' workloads running on the same physical hardware can affect your application's performance. If you suspect such interference, consider using dedicated instances. This is not easy to detect, but you need to be aware that the cloud is (usually) a shared resource and that you shouldn’t have too many services that are dedicated only to you and your business. 

When troubleshooting performance issues, here's a general approach:

- **Monitor**: Use tools like AWS CloudWatch, AWS X-Ray, and Kubernetes metrics to monitor your infrastructure and applications. Set up alerts for resource utilization and performance anomalies.

- **Analyze logs**: Examine logs from your applications and AWS services using CloudWatch Logs or third-party solutions like Loki, ELK, DataDog (we prefer open-source solutions like Loki & ELK).

- **Test**: Replicate the issue in a controlled environment and isolate potential causes by systematically testing different components (you do have dev / staging env, don't you?).

- **Optimize**: Once the issue is identified, optimize the misbehaving component. This could involve code changes, infrastructure changes, or configuration updates.

- **Review**: Regularly review your system performance and infrastructure configuration. This helps you anticipate and prevent performance issues before they impact your users.  

And remember, each application and infrastructure is unique, so these are starting points and not exhaustive solutions. It's essential to build a deep understanding of your specific setup and use a systematic approach to identify and resolve performance issues. We at Devolut have extensive experience in setting and maintaining cloud infrastructures of all sizes thus reach out to hello@devolut.io to hear more.

Read it on Linkedin