Kubernetes Resource Optimization

The key to success is twofold; know what you are doing, and do it well. Most people fail at the first step

Joel Crisp (me)

There has been a lot written about Kubernetes Resource Optimization. Kubernetes, at the core, is a resource aware scheduler designed to pack as many tasks (pods) onto compute resources (nodes) as possible in order to maximize resource consumption and thus minimize costs.

In order to do so it uses several techniques, but the one to start with is setting resource “requests” and “limits” on a container:

resources:
  requests: 
     cpu: "100m"
     memory: "1Gi"
  limits: 
     cpu: "150m"
     memory: "1.5Gi"

The Kubernetes docs and many blogs have explained in detail the meaning of these two settings, as well as other scheduler parameters such as node (anti-)affinity. The question I keep on being asked is, however, “to what values should I set these parameters?”

The answer has several parts.

Firstly: These parameters should be regarded as dynamic and should be reviewed constantly. Application resource requests depends a lot on the application itself, load and in some cases the runtime. Java, golang, python, rust all have different mechanisms for managing memory. So these parameters are not “set-and-forget” but should be living configuration which is constantly reviewed and updated.

Secondly: Kubernetes prefers horizontal scaling of load to vertical scaling, so it is better to have many smaller pods than one or two large ones. Design your application to honor this preference, and consider scaling up more pod replicas rather than adding resources to your current replicas. I would advise novice Kubernetes users to avoid using the auto-scalers due to the need to understand complex interactions and configurations when enabling them.

Thirdly: You obtain the values for these parameters by working backwards from the Service Level Objectives (SLO). The SLO is a key part to any production service and defines the user-visible behavior of the service. In this case user may mean another service or an end-user of an API, etc. SLOs should define the range of expected throughput, latency and the acceptable error rates – which will never be zero. These definitions may be as simple as 1000 requests per second (rps) with a target maximum response time of 200ms and 95% of requests served under 160ms, and an error budget of 2%. Armed with the SLO, the steps to determine these parameters become a case of measuring the metrics in the SLO and reducing the resource requests and limits until the SLO is just being met. That defines the minimum level of resource required to deliver the expected level of service.

Fourthly: Alerts should be configured on the metrics in the SLO, so that as the landscape of the application changes the replica count, requests and limits may be adjusted as appropriate. As I already mentioned, avoid using autoscalers until you have the good understanding of the application behavior and the interactions between the horizontal pod autoscaler (HPA) and the cluster (node) autoscaler.

Fifthly: I recommend no more than 20% maximum difference between requests and limits, and if you can make them the same to obtain a Kubernetes “Guaranteed” Class Of Service.

As your experience with Kubernetes and the behavior of your application matures, you will be able to add additional techniques and tools to better manage Kubernetes resources, but these are the rules of thumb to start with.

Footnote: For any internal service such as a CRUD API, you can back-calculate the SLO from the end-user (customer) SLO for the entire system using the service expectation (dependency graph) where the services closer to the end-user have to place expectations on the deeper services in order to fulfil their own SLOs