Please Don’t Evict My Pod: Priority and Budget Disruption
Join the DZone community and get the full member experience.
Join For FreeIn this post, we are going to cover the pod priority class, pod disruption budget, and the relationship of these constructs' with pod eviction. Okay, enough of talking, let’s start with pod priority class.
PriorityClass and Preemption
PriorityClass is a stable Kubernetes object from version 1.14, and it is a part of the scheduling group used for defining a mapping between priority class name and the integer value of the priority. PriorityClass is straightforward to understand; the higher the value of the integer, the higher is the priority. Take, for example, a PriorityClass with an integer value of ten and another with an integer value of twenty; the later one holds a higher priority than the first one.
PriorityClass is a non-namespaced object and has one particular optional boolean field named as globalDefault
. Among all the PriorityClass objects in a cluster, only one object in a cluster can have this value as globalDefault=true
, which means the integer value of this object represents the default priority value of all the pods in a K8s cluster without specific priorityClassName
value in pod definition. By default, if there is no PriorityClass object with globalDefault=true
value, then default pod priority value is set to zero.
Later, if we add an object with globalDefault=true
value, then all new pods without a specific priorityClassName
value have a priority value equals to the integer value of the PriorityClass object; however, the old pod priority remains zero. By default, Kubernetes cluster ships with two PriorityClasses: system-cluster-critical
and system-node-critical
. system-node-critical
is the highest available priority, even higher than system-cluster-critical
.
Let’s see how the priority of a pod affects the behaviour of the K8s cluster kube-scheduler
and results in the eviction of the other pods from a node. Kube-scheduler
tries to schedule a newly created pod on the K8s cluster; however, if the resources required for a pod is not available on any node, PriorityClass preemption logic comes into the picture. Based on the priority of the pod, kube-scheduler
determines the node where eviction of low priority pods results in its execution.
The preemption process results in the eviction of the low priority pods from
a node to schedule high priority pod on a node.
A PriorityClass object has a field named PreemptionPolicy
, which defines the behaviour of the object that corresponds to preemption. By default, its values are PreemptionPolicy=PreemptLowerPriority
, which will allow pods of that PriorityClass to preempt lower-priority pods. If PreemptionPolicy=Never
, pods in that PriorityClass will be non-preempting other pods. Let’s quickly see the example of preempting and non-preempting:
---
apiVersion scheduling.k8s.io/v1
kind PriorityClass
metadata
name high-priority-preempting
value1000000
preemptionPolicy PreemptLowerPriority
globalDefaultfalse
description"This priority class will cause other lower priority pods to be preempted."
---
apiVersion scheduling.k8s.io/v1
kind PriorityClass
metadata
name high-priority-nonpreempting
value1000000
preemptionPolicy Never
globalDefaultfalse
description"This priority class will not cause other pods to be preempted."
---
---
apiVersion v1
kind Pod
metadata
name nginx-preempting
labels
env test
spec
containers
name nginx-preempting
image nginx-preempting
imagePullPolicy IfNotPresent
priorityClassName high-priority-preempting
---
apiVersion v1
kind Pod
metadata
name nginx-nonpreempting
labels
env test
spec
containers
name nginx-nonpreempting
image nginx-nonpreempting
imagePullPolicy IfNotPresent
priorityClassName high-priority-nonpreempting
Hang on with a preemption here, and we will revisit it after formalizing our understanding of the pod disruption budget.
Pod Disruption Budget
PodDisruptionBudget (PDB) is also a Kubernetes object that works at the application level. PDB defines the limits of the number of pods of a replication-set to go down simultaneously. PDB is an indicator of how much disruption an application can handle at a given time. One of the best use-cases of the PDB is to use it with the app, which requires quorum management, for example, zookeeper. Below is the definition of a PDB object, which defines min availability
of the pod should be two.
xxxxxxxxxx
apiVersion policy/v1beta1
kind PodDisruptionBudget
metadata
name zk-pdb
spec
minAvailable2
selector
matchLabels
app zookeeper
Commands for PDB
xxxxxxxxxx
kubectl get poddisruptionbudgets
kubectl get poddisruptionbudgets zk-pdb -o yaml
PDB of an application is an import aspect which takes into consideration while performing disruption voluntarily in a K8s cluster. It will halt the disruption process to maintain the disruption budget of the app. PDB is very helpful in-case of cluster activities like node drain or balancing the K8s cluster using projects like Descheduler, but is PDB is useful in preemption too?
Preemption respects PDB with best effort, which means the scheduler tries to find the victim for eviction considering the PDB of an application and tries not to violate. Still, if no such option is available, then preemption will happen to dishonor the PDB of an app. For testing the PDB and eviction, you can try a kubectl-evict-pod
plugin.
Warning: In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priorities, causing other pods eviction and pending for scheduling. Also, improper use of PriorityClass may lead to the cascading failure eventually results in production outage, like the following one shared by Grafana community.
How Pod PriorityClass, QoS Class, and Eviction Policy Are Linked
PriorityClass and QoS class of a pod are two independent and unrelated features. There is no specification and rules related to the QoS class of a pod and its priority. Hence, it is possible that for scheduling high priority pod, the node can evict the Guaranteed QoS class pod because of low priority.
The only component that considers both QoS and Pod priority is kubelet out-of-resource eviction. The kubelet ranks Pods for eviction first by whether or not their usage of the starved resource exceeds requests, then by priority, and then by the consumption of the starved compute resource relative to the Pods’ scheduling requests.
Putting it All Together
Pod priority, QoS class, and eviction policy all together create a balancing combination in the K8s cluster. Adding new objects without considering the effects on another will destabilize the cluster state and can lead to catastrophe. In another post, I will share some of the best practices that would help in managing the cluster state better without many evictions.
Opinions expressed by DZone contributors are their own.
Comments