This blog shows how we leveraged the Kubernetes cluster autoscaler with Amazon EKS service in order to build a cost effective solution for an on-demand deployment of microservices in a dynamically scaling environment. This blog along with a detailed explanation of the use case also provides a step-by-step guide to enable the cluster autoscaler in an existing kubernetes cluster on AWS.
Use Case
In one of our recent customer engagements, we were entrusted with hosting the customer’s flagship product as a self-service SaaS solution. The product was initially designed for an on-premise deployment. As the product traction increased, our customer needed a quick and simple setup for encouraging their prospects to experience their product features before engaging fully.
We at Cognitree have addressed such requirements for several of our customers. We have provided the solution architecture and also helped implement the solution and deploy them on the customer infrastructure. In most of such use cases these are some of the common constraints :
- Self-service: The user signup and setup process should be completely automated with no human intervention
- Robust setup: Build a simple deployment procedure that is easy to build and test. Also, ability to get the user going with the trial deployment in the matter of a few mins from the time the user signed up.
- Cost Optimization: Optimize for costs. Cap the costs incurred per trial signup
- Observability: Enable business owners to get complete visibility of the trials including usage patterns and costs. Also, enable operations to continuously assess the deployments to ensure reliable and smooth experience of the product.
The trial deployment had to be set up on-demand whenever a user signed up and the lifetime of a trial deployment lasted for a week or so. On expiry of the trial, the resources allocated for the deployment had to be reclaimed. To cater to such elastic infrastructure needs, we decided to leverage a cloud infrastructure such as Amazon AWS. Amazon AWS is the leading provider of cloud infrastructure and we have successfully launched solutions on it for a lot of our customers in the past. So it was our natural choice.
Kubernetes addresses a couple of the challenges easily. Ensuring optimal resource utilization, managing service discovery between the micro services and also exposing few services to external communication. Kubernetes also brings in a rich collection of deployment patterns, constructs, integrations and best practices that together help to deploy, manage and scale containerized applications.
Overall, we needed Kubernetes to help us automate the deployments and AWS to provide us the elastic infrastructure required on-demand. Amazon AWS provides a hosted solution for the Kubernetes control plane through its EKS offering which seemed apt to leverage.
To address the elasticity requirements, we picked the Kubernetes cluster autoscaler which is a part of the Kubernetes cluster autoscaler solution.
Kubernetes cluster autoscaler
In K8s, auto scaling can be implemented via a controller which manages the number of running objects based on some events. K8s supports various types of cluster autoscalers which can be implemented individually or combined together for a more robust auto scaling solution.
- Horizontal Pod Autoscaler (HPA) – An algorithm based controller which automatically adjusts the number of replicas in a Deployment, ReplicaSet or Replication Controller based on CPU utilization in the node.
- Vertical Pod Autoscaler (VPA) – VPA automatically sets Container resource requirements (CPU and memory) in a Pod and dynamically adjusts them in runtime, based on current resource availability, historical utilization data and real-time events.
- Cluster autoscaler (CA) – Cluster autoscaler automatically re-sizes the Kubernetes cluster when there are insufficient resources available for new pods expecting to be scheduled or when there are underutilized nodes in the cluster.
For this use case the Kubernetes cluster autoscaler served as a perfect fit. It helped us launch K8s worker nodes on demand and we were able to meet all the constraints which most of our customers have laid out as discussed above. The Kuberentes cluster autoscaler is also supported by many cloud providers including AliCloud, Azure, AWS, BaiduCloud etc. and we successfully leveraged it for our use case on AWS.
The steps mentioned below will set up a cluster scaler on an EKS based k8s cluster. You will be able to observe your Amazon EKS cluster autoscaling on a need basis.
Characteristics of Kubernetes cluster autoscaler
- The cluster autoscaler only generates events to the ASG. It respects the current minimum and maximum value of ASG that is being targeted and only adjusts the desired capacity. There is no need to explicitly configure the capacities with the cluster autoscaler.
- The cluster autoscaler drains the worker node before removing it from the cluster. If CA decided to de-provision a node with some pods running on it, they will be rescheduled to another node before draining this node.
- There is a configurable latency that can be introduced in scale up and scale down of the cluster. For scale up the default value is 10s which allows the CA to ignore unschedulable pods until they are of a certain age. For scale down default is 10m and this is the duration for which the CA waits before removing that node. These values can be manipulated using appropriate flags.
- The cluster autoscaler also supports multiple ASGs in a single cluster. You can refer to this link for more information.
Acronyms
- k8s: Kubernetes
- AWS : Amazon Web Services
- EKS : Amazon Elastic Kubernetes Service
- ASG : AWS Auto scaling group
- CA : Cluster Autoscaler
- IAM : Identity and Access Management
Setting up the Kubernetes cluster autoscaler for EKS
As a prerequisite, it is assumed that you already have an EKS cluster. If not, you can refer to the documentation here to set up an EKS cluster. In our case, we had an ASG to manage the worker nodes in the cluster. Follow the steps below to set up the Kubernetes cluster autoscaler.
Add autoscaling permissions to the K8s worker nodes
Create a new IAM policy as shown below and attach it to the instance profile role of k8s worker nodes. This policy is attached to the worker nodes because cluster autoscaler may get launched on any one of the worker nodes and it needs access to AWS resources i.e, ASG and EC2 to perform the autoscaling.
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeTags", "autoscaling:SetDesiredCapacity", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeLaunchTemplateVersions" ], "Resource": "*", "Effect": "Allow" } ] }
Tag your Worker node’s ASG
Add the following tags to the k8s worker nodes’ ASG. This will help the cluster autoscaler to auto discover the ASG. Substitute the EKS cluster’s name in the place holder marked as <cluster-name> below. The value of the tags is irrelevant for the functionality.
k8s.io/cluster-autoscaler/<cluster-name>: owned k8s.io/cluster-autoscaler/enabled: true
Deploying the Kubernetes cluster autoscaler
Fetch the latest cluster-autoscaler deployment spec from this link. And edit the downloaded specification file as per the instructions below:
- Set the cluster autoscaler release to the version that matches your cluster’s Kubernetes major and minor version. If your cluster’s k8s version is 1.15, find the latest cluster autoscaler version in the release page that begins with 1.15.
- The cluster autoscaler provides us a way to ensure that pods with critical functionality do not get disrupted due to the scaling activities. Such pods need to be annotated with the annotation cluster-autoscaler.kubernetes.io/safe-to-evict=”false”. cluster autoscaler itself is one such pod with critical functionality and needs to be annotated. Add this under the “annotations” section for the pod.
- Add the following options to the container command in the specification
- –node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
- This will enable auto discovery of ASGs having these tags. The value of the tag is irrelevant. Replace <YOUR CLUSTER NAME> with your EKS cluster’s name.
- –node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
-
- –balance-similar-node-groups
- In case your EKS cluster has multiple ASGs, we can set this flag to balance the number of instances across all the ASGs providing similar instance types.
- –balance-similar-node-groups
-
- –skip-nodes-with-system-pods=false
- By default, cluster autoscaler will not terminate nodes running pods in the kube-system namespace. If you want to override that, you can set the value of this flag to false.
- –skip-nodes-with-system-pods=false
If your EKS cluster name is “test-eks-cluster” and your k8s version is 1.15, Your specification should look something like below:
metadata: name: cluster-autoscaler namespace: kube-system annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false" .. spec: .. template: .. spec: serviceAccountName: cluster-autoscaler containers: - image: k8s.gcr.io/cluster-autoscaler:v1.15.6 name: cluster-autoscaler resources: limits: cpu: 100m memory: 300Mi requests: cpu: 100m memory: 300Mi command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/test-eks-cluster - --balance-similar-node-groups - --skip-nodes-with-system-pods=false
Using the kubectl command deploy the cluster autoscaler into your cluster
kubectl apply -f cluster-autoscaler-autodiscover.yaml
Validate the setup
Assert that the cluster autoscaler pod started successfully.
kubectl get po -A | grep cluster-autoscaler
The above command should show exactly 1 pod in the kube-system namespace. This pod should be in the “Running” state.
kube-system cluster-autoscaler-fd6db7bf9-q4sk7 1/1 Running 0 3m28s
As soon as pod is in running state you can see the cluster autoscaler log using the command:
kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler
This shows the logs that include startup of cluster autoscaler and memory utilization by each node. Following is an example of memory utilization when the autoscaler starts.
I0409 09:31:18.426105 1 static_autoscaler.go:132] Starting main loop I0409 09:31:18.443276 1 utils.go:564] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop …. I0409 09:31:18.501701 1 utils.go:521] Skipping ip-192-168-97-136.ap-south-1.compute.internal - node group min size reached I0409 09:31:18.501781 1 static_autoscaler.go:370] Scale down status: unneededOnly=false lastScaleUpTime=2020-04-09 08:50:18.045062303 +0000 UTC m=+19.642557212 lastScaleDownDeleteTime=2020-04-09 08:50:18.045062387 +0000 UTC m=+19.642557314 lastScaleDownFailTime=2020-04-09 08:50:18.045062814 +0000 UTC m=+19.642557723 scaleDownForbidden=false isDeleteInProgress=false I0409 09:31:18.501806 1 static_autoscaler.go:380] Starting scale down I0409 09:31:18.501840 1 scale_down.go:659] No candidates for scale down
Verify the autoscaling
To test the auto scaling feature, we will launch deployments with some resource constraints and see that when required new nodes get provisioned into the k8s cluster and when under utilized, the nodes get released from the k8s cluster.
Let’s get started with it:
- Fetch a sample nginx deployment config from this link.
- Change the replicas value to 1
- Add the cpu and memory limit and request to the deployment. The value should be according to the instance type launched in the ASG. For example, If you launch a t3.medium (2 vCPU, 4 Mem(GiB)) instance following can be your spec.
spec: containers: - name: nginx image: nginx:1.14.2 resources: limits: cpu: 1500m memory: 2500Mi requests: cpu: 1200m memory: 2500Mi ports: - containerPort: 80
- Use kubectl to apply the configuration and launch the deployment.
kubectl apply -f nginx.yaml
- Check if the cluster upscales. If not, again edit the value of replicas and change it to 2. Use kubectl to apply the configuration.
kubectl apply -f nginx.yaml
- You will notice that the number of nodes in your cluster increases and you cluster auto scales. You can also see the upscale in the logs. Following is a sample log from our setup.
I0409 12:18:25.015589 1 static_autoscaler.go:132] Starting main loop ... I0409 12:18:25.101631 1 scale_up.go:263] Pod default/nginx-deployment-ff6df446-6m5nw is unschedulable I0409 12:18:25.101705 1 scale_up.go:300] Upcoming 0 nodes I0409 12:18:25.101782 1 waste.go:57] Expanding Node Group platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN would waste 40.00% CPU, 35.65% Memory, 37.83% Blended I0409 12:18:25.101803 1 scale_up.go:423] Best option to resize: platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN I0409 12:18:25.101811 1 scale_up.go:427] Estimated 1 nodes needed in platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN I0409 12:18:25.101829 1 scale_up.go:529] Final scale-up plan: [{platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN 1->2 (max: 100)}] I0409 12:18:25.101844 1 scale_up.go:694] Scale-up: setting group platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN size to 2 I0409 12:18:25.101877 1 auto_scaling_groups.go:211] Setting asg platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN size to 2 I0409 12:18:25.103025 1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"d36ebf64-7a3d-11ea-84a8-02afb12a2ce0", APIVersion:"v1", ResourceVersion:"36245", Field Path:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN size to 2 I0409 12:18:25.242973 1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"d36ebf64-7a3d-11ea-84a8-02afb12a2ce0", APIVersion:"v1", ResourceVersion:"36245", Field Path:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN size set to 2 I0409 12:18:25.242999 1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nginx-deployment-ff6df446-6m5nw", UID:"330c9154-7a5c-11ea-84a8-02afb12a2ce0", APIVersion:"v1", ResourceVersion:"36267", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN 1->2 (max: 100)}]
- Now wait for sometime and you will see a node is launched
I0409 12:19:55.902030 1 static_autoscaler.go:132] Starting main loop I0409 12:19:55.908862 1 clusterstate.go:194] Scale up in group platform-12-EKSStack-VM8YXAB3PSPQ-EKSWorker-JDRDZZHSQ4DG-NodeAutoScalingGroup-YMZKZ4PATVAN finished successfully in 1m30.659303454s I0409 12:19:55.908911 1 utils.go:564] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop. ... I0409 12:19:55.909015 1 scale_down.go:407] Node ip-192-168-97-136.ap-south-1.compute.internal - utilization 0.805000 I0409 12:19:55.909028 1 scale_down.go:411] Node ip-192-168-97-136.ap-south-1.compute.internal is not suitable for removal - utilization too big (0.805000) I0409 12:19:55.909038 1 scale_down.go:407] Node ip-192-168-81-240.ap-south-1.compute.internal - utilization 0.660492 I0409 12:19:55.909050 1 scale_down.go:411] Node ip-192-168-81-240.ap-south-1.compute.internal is not suitable for removal - utilization too big (0.660492) I0409 12:19:55.909090 1 static_autoscaler.go:370] Scale down status: unneededOnly=true lastScaleUpTime=2020-04-09 12:18:25.015570456 +0000 UTC m=+12506.613065495 lastScaleDownDeleteTime=2020-04-09 09:50:48.908581166 +0000 UTC m=+365 0.506076149 lastScaleDownFailTime=2020-04-09 08:50:18.045062814 +0000 UTC m=+19.642557723 scaleDownForbidden=false isDeleteInProgress=false
- Now again change the replica count to 1 you will see the cluster downscales. If not change the count to 0. Following are the logs for downscale of the system.
1 static_autoscaler.go:132] Starting main loop I0409 12:29:21.635394 1 utils.go:564] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop ... I0409 12:29:21.635519 1 scale_down.go:407] Node ip-192-168-97-136.ap-south-1.compute.internal - utilization 0.805000 I0409 12:29:21.635531 1 scale_down.go:411] Node ip-192-168-97-136.ap-south-1.compute.internal is not suitable for removal - utilization too big (0.805000) I0409 12:29:21.635541 1 scale_down.go:407] Node ip-192-168-81-240.ap-south-1.compute.internal - utilization 0.055000 I0409 12:29:21.635635 1 static_autoscaler.go:359] ip-192-168-81-240.ap-south-1.compute.internal is unneeded since 2020-04-09 12:29:21.628614023 +0000 UTC m=+13163.226108967 duration 0s I0409 12:29:21.635650 1 static_autoscaler.go:370] Scale down status: unneededOnly=false lastScaleUpTime=2020-04-09 12:18:25.015570456 +0000 UTC m=+12506.613065495 lastScaleDownDeleteTime=2020-04-09 09:50:48.908581166 +0000 UTC m=+3650.506076149 lastScaleDownFailTime=2020-04-09 08:50:18.045062814 +0000 UTC m=+19.642557723 scaleDownForbidden=false isDeleteInProgress=false I0409 12:29:21.635672 1 static_autoscaler.go:380] Starting scale down I0409 12:29:21.635696 1 scale_down.go:600] ip-192-168-81-240.ap-south-1.compute.internal was unneeded for 0s
- You see here that node utilization of the node decreases and node is eligible for de-provisioning. There is a delay of 10m before the nodes are de-provisioned. Once done you will see the following logs.
I0409 12:39:26.865316 1 utils.go:564] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop ... I0409 12:39:26.865437 1 scale_down.go:407] Node ip-192-168-81-240.ap-south-1.compute.internal - utilization 0.055000 I0409 12:39:26.865450 1 scale_down.go:407] Node ip-192-168-97-136.ap-south-1.compute.internal - utilization 0.805000 I0409 12:39:26.865457 1 scale_down.go:411] Node ip-192-168-97-136.ap-south-1.compute.internal is not suitable for removal - utilization too big (0.805000) I0409 12:39:26.865528 1 static_autoscaler.go:359] ip-192-168-81-240.ap-south-1.compute.internal is unneeded since 2020-04-09 12:29:21.628614023 +0000 UTC m=+13163.226108967 duration 10m5.229828964s ... I0409 12:39:26.865553 1 static_autoscaler.go:380] Starting scale down I0409 12:39:26.865580 1 scale_down.go:600] ip-192-168-81-240.ap-south-1.compute.internal was unneeded for 10m5.229828964s I0409 12:39:26.865639 1 scale_down.go:819] Scale-down: removing empty node ip-192-168-81-240.ap-south-1.compute.internal I0409 12:39:26.865997 1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"d36ebf64-7a3d-11ea-84a8-02afb12a2ce0", APIVersion:"v1", ResourceVersion:"39753", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-192-168-81-240.ap-south-1.compute.internal I0409 12:39:26.873795 1 delete.go:64] Successfully added toBeDeletedTaint on node ip-192-168-81-240.ap-south-1.compute.internal I0409 12:39:27.061899 1 auto_scaling_groups.go:269] Terminating EC2 instance: i-0edd1f4f6aa50b615 I0409 12:39:27.061921 1 aws_manager.go:180] Some ASG instances might have been deleted, forcing ASG list refresh ... I0409 12:39:27.208857 1 factory.go:33] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-192-168-81-240.ap-south-1.compute.internal", UID:"5f017e4b-7a5c-11ea-84a8-02afb12a2ce0", APIVersion:"v1", ResourceVersion:"39777", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' node removed by cluster autoscaler I0409 12:39:27.208874 1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"d36ebf64-7a3d-11ea-84a8-02afb12a2ce0", APIVersion:"v1", ResourceVersion:"39753", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: empty node ip-192-168-81-240.ap-south-1.compute.internal removed
Conclusion
As we see in this use case, the Kubernetes cluster autoscaler brings together a solution that spans two different ecosystems: the infrastructure and the deployment platform. It connects the k8s scaling events in the deployment platform (k8s) to the controls (such as ASG) in the hosting infrastructure (AWS).
The provisioning, management, capacity controls still lie in the infrastructure layer (AWS). The admins can remain concerned with the infrastructure and specify scaling limits in the infrastructure elements like ASG. The DevOps engineers on the other hand continue to automate deployments and specify resource requirements to trigger scale up or scale down events which still lie in the deployment platform ecosystem.
You can also read about our blog on deploying big data stack on Kubernetes here.
If you are interested in learning more, please reach out to us at solutions@cognitree.com or through the comments section below.