Kubernetes Session 3
Automatic Cleanup for Finished Jobs
A time-to-live mechanism to clean up old Jobs that have finished execution.
ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
How a ReplicaSet works
A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas indicating how many Pods it should be maintaining, and a pod template specifying the data of new Pods it should create to meet the number of replicas criteria. A ReplicaSet then fulfills its purpose by creating and deleting Pods as needed to reach the desired number. When a ReplicaSet needs to create new Pods, it uses its Pod template.
A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which specifies what resource the current object is owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet's identifying information within their ownerReferences field. It's through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the OwnerReference is not a Controller and it matches a ReplicaSet's selector, it will be immediately acquired by said ReplicaSet.
When to use a ReplicaSet
A ReplicaSet ensures that a specified number of pod replicas are running at any given time. However, a Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features. Therefore, we recommend using Deployments instead of directly using ReplicaSets, unless you require custom update orchestration or don't require updates at all.
This actually means that you may never need to manipulate ReplicaSet objects: use a Deployment instead, and define your application in the spec section.
Example
Saving this manifest into frontend.yaml
and submitting it to a Kubernetes cluster will create the defined ReplicaSet and the Pods that it manages.
You can then get the current ReplicaSets deployed:
And see the frontend one you created:
You can also check on the state of the ReplicaSet:
And you will see output similar to:
And lastly you can check for the Pods brought up:
You should see Pod information similar to:
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet. To do this, get the yaml of one of the Pods running:
The output will look similar to this, with the frontend ReplicaSet's info set in the metadata's ownerReferences field:
Non-Template Pod acquisitions
While you can create bare Pods with no problems, it is strongly recommended to make sure that the bare Pods do not have labels which match the selector of one of your ReplicaSets. The reason for this is because a ReplicaSet is not limited to owning Pods specified by its template-- it can acquire other Pods in the manner specified in the previous sections.
Take the previous frontend ReplicaSet example, and the Pods specified in the following manifest:
As those Pods do not have a Controller (or any object) as their owner reference and match the selector of the frontend ReplicaSet, they will immediately be acquired by it.
Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up its initial Pod replicas to fulfill its replica count requirement:
The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the ReplicaSet would be over its desired count.
Fetching the Pods:
The output shows that the new Pods are either already terminated, or in the process of being terminated:
If you create the Pods first:
And then create the ReplicaSet however:
You shall see that the ReplicaSet has acquired the Pods and has only created new ones according to its spec until the number of its new Pods and the original matches its desired count. As fetching the Pods:
Will reveal in its output:
In this manner, a ReplicaSet can own a non-homogenous set of Pods
Writing a ReplicaSet manifest
As with all other Kubernetes API objects, a ReplicaSet needs the apiVersion
, kind
, and metadata
fields. For ReplicaSets, the kind
is always a ReplicaSet.
When the control plane creates new Pods for a ReplicaSet, the .metadata.name
of the ReplicaSet is part of the basis for naming those Pods. The name of a ReplicaSet must be a valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.
A ReplicaSet also needs a .spec
section.
Pod Template
The .spec.template
is a pod template which is also required to have labels in place. In our frontend.yaml
example we had one label: tier: frontend
. Be careful not to overlap with the selectors of other controllers, lest they try to adopt this Pod.
For the template's restart policy field, .spec.template.spec.restartPolicy
, the only allowed value is Always
, which is the default.
Pod Selector
The .spec.selector
field is a label selector. As discussed earlier these are the labels used to identify potential Pods to acquire. In our frontend.yaml
example, the selector was:
In the ReplicaSet, .spec.template.metadata.labels
must match spec.selector
, or it will be rejected by the API.
Note: For 2 ReplicaSets specifying the same .spec.selector
but different .spec.template.metadata.labels
and .spec.template.spec
fields, each ReplicaSet ignores the Pods created by the other ReplicaSet.
Replicas
You can specify how many Pods should run concurrently by setting .spec.replicas
. The ReplicaSet will create/delete its Pods to match this number.
If you do not specify .spec.replicas
, then it defaults to 1.
Working with ReplicaSets
Deleting a ReplicaSet and its Pods
To delete a ReplicaSet and all of its Pods, use kubectl delete
. The Garbage collector automatically deletes all of the dependent Pods by default.
When using the REST API or the client-go
library, you must set propagationPolicy
to Background
or Foreground
in the -d
option. For example:
Deleting just a ReplicaSet
You can delete a ReplicaSet without affecting any of its Pods using kubectl delete
with the --cascade=orphan
option. When using the REST API or the client-go
library, you must set propagationPolicy
to Orphan
. For example:
Once the original is deleted, you can create a new ReplicaSet to replace it. As long as the old and new .spec.selector
are the same, then the new one will adopt the old Pods. However, it will not make any effort to make existing Pods match a new, different pod template. To update Pods to a new spec in a controlled way, use a Deployment, as ReplicaSets do not support a rolling update directly.
Isolating Pods from a ReplicaSet
You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove Pods from service for debugging, data recovery, etc. Pods that are removed in this way will be replaced automatically ( assuming that the number of replicas is not also changed).
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas
field. The ReplicaSet controller ensures that a desired number of Pods with a matching label selector are available and operational.
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down pods based on the following general algorithm:
Pending (and unschedulable) pods are scaled down first
If
controller.kubernetes.io/pod-deletion-cost
annotation is set, then the pod with the lower value will come first.Pods on nodes with more replicas come before pods on nodes with fewer replicas.
If the pods' creation times differ, the pod that was created more recently comes before the older pod (the creation times are bucketed on an integer log scale when the
LogarithmicScaleDown
feature gate is enabled)
If all of the above match, then selection is random.
Pod deletion cost
FEATURE STATE: Kubernetes v1.22 [beta]
Using the controller.kubernetes.io/pod-deletion-cost
annotation, users can set a preference regarding which pods to remove first when downscaling a ReplicaSet.
The annotation should be set on the pod, the range is [-2147483647, 2147483647]. It represents the cost of deleting a pod compared to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are preferred to be deleted before pods with higher deletion cost.
The implicit value for this annotation for pods that don't set it is 0; negative values are permitted. Invalid values will be rejected by the API server.
This feature is beta and enabled by default. You can disable it using the feature gate PodDeletionCost
in both kube-apiserver and kube-controller-manager.
Note:
This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
Users should avoid updating the annotation frequently, such as updating it based on a metric value, because doing so will generate a significant number of pod updates on the apiserver.
Example Use Case
The different pods of an application could have different utilization levels. On scale down, the application may prefer to remove the pods with lower utilization. To avoid frequently updating the pods, the application should update controller.kubernetes.io/pod-deletion-cost
once before issuing a scale down (setting the annotation to a value proportional to pod utilization level). This works if the application itself controls the down scaling; for example, the driver pod of a Spark deployment.
ReplicaSet as a Horizontal Pod Autoscaler Target
A ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). That is, a ReplicaSet can be auto-scaled by an HPA. Here is an example HPA targeting the ReplicaSet we created in the previous example.
Saving this manifest into hpa-rs.yaml
and submitting it to a Kubernetes cluster should create the defined HPA that autoscales the target ReplicaSet depending on the CPU usage of the replicated Pods.
Alternatively, you can use the kubectl autoscale
command to accomplish the same (and it's easier!)
ReplicationController
Note: A Deployment
that configures a ReplicaSet
is now the recommended way to set up replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.
How a ReplicationController Works
If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the ReplicationController starts more pods. Unlike manually created pods, the pods maintained by a ReplicationController are automatically replaced if they fail, are deleted, or are terminated. For example, your pods are re-created on a node after disruptive maintenance such as a kernel upgrade. For this reason, you should use a ReplicationController even if your application requires only a single pod. A ReplicationController is similar to a process supervisor, but instead of supervising individual processes on a single node, the ReplicationController supervises multiple pods across multiple nodes.
ReplicationController is often abbreviated to "rc" in discussion, and as a shortcut in kubectl commands.
A simple case is to create one ReplicationController object to reliably run one instance of a Pod indefinitely. A more complex use case is to run several identical replicas of a replicated service, such as web servers.
Running an example ReplicationController
This example ReplicationController config runs three copies of the nginx web server.
Run the example job by downloading the example file and then running this command:
The output is similar to this:
Check on the status of the ReplicationController using this command:
The output is similar to this:
Here, three pods are created, but none is running yet, perhaps because the image is being pulled. A little later, the same command may show:
To list all the pods that belong to the ReplicationController in a machine readable form, you can use a command like this:
The output is similar to this:
Here, the selector is the same as the selector for the ReplicationController (seen in the kubectl describe
output), and in a different form in replication.yaml
. The --output=jsonpath
option specifies an expression with the name from each pod in the returned list.
Writing a ReplicationController Manifest
As with all other Kubernetes config, a ReplicationController needs apiVersion
, kind
, and metadata
fields.
When the control plane creates new Pods for a ReplicationController, the .metadata.name
of the ReplicationController is part of the basis for naming those Pods. The name of a ReplicationController must be a valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.
For general information about working with configuration files, see object management.
A ReplicationController also needs a .spec
section.
Pod Template
The .spec.template
is the only required field of the .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a pod template in a ReplicationController must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See pod selector.
Only a .spec.template.spec.restartPolicy
equal to Always
is allowed, which is the default if not specified.
For local container restarts, ReplicationControllers delegate to an agent on the node, for example the Kubelet.
Labels on the ReplicationController
The ReplicationController can itself have labels (.metadata.labels
). Typically, you would set these the same as the .spec.template.metadata.labels
; if .metadata.labels
is not specified then it defaults to .spec.template.metadata.labels
. However, they are allowed to be different, and the .metadata.labels
do not affect the behavior of the ReplicationController.
Pod Selector
The .spec.selector
field is a label selector. A ReplicationController manages all the pods with labels that match the selector. It does not distinguish between pods that it created or deleted and pods that another person or process created or deleted. This allows the ReplicationController to be replaced without affecting the running pods.
If specified, the .spec.template.metadata.labels
must be equal to the .spec.selector
, or it will be rejected by the API. If .spec.selector
is unspecified, it will be defaulted to .spec.template.metadata.labels
.
Also you should not normally create any pods whose labels match this selector, either directly, with another ReplicationController, or with another controller such as Job. If you do so, the ReplicationController thinks that it created the other pods. Kubernetes does not stop you from doing this.
If you do end up with multiple controllers that have overlapping selectors, you will have to manage the deletion yourself (see below).
Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.replicas
to the number of pods you would like to have running concurrently. The number running at any time may be higher or lower, such as if the replicas were just increased or decreased, or if a pod is gracefully shutdown, and a replacement starts early.
If you do not specify .spec.replicas
, then it defaults to 1.
Working with ReplicationControllers
Deleting a ReplicationController and its Pods
To delete a ReplicationController and all its pods, use kubectl delete
. Kubectl will scale the ReplicationController to zero and wait for it to delete each pod before deleting the ReplicationController itself. If this kubectl command is interrupted, it can be restarted.
When using the REST API or client library, you need to do the steps explicitly (scale replicas to 0, wait for pod deletions, then delete the ReplicationController).
Deleting only a ReplicationController
You can delete a ReplicationController without affecting any of its pods.
Using kubectl, specify the --cascade=orphan
option to kubectl delete
.
When using the REST API or client library, you can delete the ReplicationController object.
Once the original is deleted, you can create a new ReplicationController to replace it. As long as the old and new .spec.selector
are the same, then the new one will adopt the old pods. However, it will not make any effort to make existing pods match a new, different pod template. To update pods to a new spec in a controlled way, use a rolling update.
Isolating pods from a ReplicationController
Pods may be removed from a ReplicationController's target set by changing their labels. This technique may be used to remove pods from service for debugging and data recovery. Pods that are removed in this way will be replaced automatically (assuming that the number of replicas is not also changed).
Common usage patterns
Rescheduling
As mentioned above, whether you have 1 pod you want to keep running, or 1000, a ReplicationController will ensure that the specified number of pods exists, even in the event of node failure or pod termination (for example, due to an action by another control agent).
Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually or by an auto-scaling control agent, by updating the replicas
field.
Rolling updates
The ReplicationController is designed to facilitate rolling updates to a service by replacing pods one-by-one.
As explained in #1353, the recommended approach is to create a new ReplicationController with 1 replica, scale the new (+1) and old (-1) controllers one by one, and then delete the old controller after it reaches 0 replicas. This predictably updates the set of pods regardless of unexpected failures.
Ideally, the rolling update controller would take application readiness into account, and would ensure that a sufficient number of pods were productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one differentiating label, such as the image tag of the primary container of the pod, since it is typically image updates that motivate rolling updates.
Multiple release tracks
In addition to running multiple releases of an application while a rolling update is in progress, it's common to run multiple releases for an extended period of time, or even continuously, using multiple release tracks. The tracks would be differentiated by labels.
For instance, a service might target all pods with tier in (frontend), environment in (prod)
. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a ReplicationController with replicas
set to 9 for the bulk of the replicas, with labels tier=frontend, environment=prod, track=stable
, and another ReplicationController with replicas
set to 1 for the canary, with labels tier=frontend, environment=prod, track=canary
. Now the service is covering both the canary and non-canary pods. But you can mess with the ReplicationControllers separately to test things out, monitor the results, etc.
Using ReplicationControllers with Services
Multiple ReplicationControllers can sit behind a single service, so that, for example, some traffic goes to the old version, and some goes to the new version.
A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived as services. Services may be composed of pods controlled by multiple ReplicationControllers, and it is expected that many ReplicationControllers may be created and destroyed over the lifetime of a service (for instance, to perform an update of pods that run the service). Both services themselves and their clients should remain oblivious to the ReplicationControllers that maintain the pods of the services.
Writing programs for Replication
Pods created by a ReplicationController are intended to be fungible and semantically identical, though their configurations may become heterogeneous over time. This is an obvious fit for replicated stateless servers, but ReplicationControllers can also be used to maintain availability of master-elected, sharded, and worker-pool applications. Such applications should use dynamic work assignment mechanisms, such as the RabbitMQ work queues, as opposed to static/one-time customization of the configuration of each pod, which is considered an anti-pattern. Any pod customization performed, such as vertical auto-sizing of resources (for example, cpu or memory), should be performed by another online controller process, not unlike the ReplicationController itself.
Responsibilities of the ReplicationController
The ReplicationController ensures that the desired number of pods matches its label selector and are operational. Currently, only terminated pods are excluded from its count. In the future, readiness and other information available from the system may be taken into account, we may add more controls over the replacement policy, and we plan to emit events that could be used by external clients to implement arbitrarily sophisticated replacement and/or scale-down policies.
The ReplicationController is forever constrained to this narrow responsibility. It itself will not perform readiness nor liveness probes. Rather than performing auto-scaling, it is intended to be controlled by an external auto-scaler (as discussed in #492), which would change its replicas
field. We will not add scheduling policies (for example, spreading) to the ReplicationController. Nor should it verify that the pods controlled match the currently specified template, as that would obstruct auto-sizing and other automated processes. Similarly, completion deadlines, ordering dependencies, configuration expansion, and other features belong elsewhere. We even plan to factor out the mechanism for bulk pod creation (#170).
The ReplicationController is intended to be a composable building-block primitive. We expect higher-level APIs and/or tools to be built on top of it and other complementary primitives for user convenience in the future. The "macro" operations currently supported by kubectl (run, scale) are proof-of-concept examples of this. For instance, we could imagine something like Asgard managing ReplicationControllers, auto-scalers, services, scheduling policies, canaries, etc.
Deployments
A Deployment provides declarative updates for Pods and ReplicaSets.
You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a controlled rate. You can define Deployments to create new ReplicaSets, or to remove existing Deployments and adopt all their resources with new Deployments.
Note: Do not manage ReplicaSets owned by a Deployment. Consider opening an issue in the main Kubernetes repository if your use case is not covered below.
Use Case
The following are typical use cases for Deployments:
Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to see if it succeeds or not.
Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet updates the revision of the Deployment.
Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the revision of the Deployment.
Pause the rollout of a Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
Use the status of the Deployment as an indicator that a rollout has stuck.
Clean up older ReplicaSets that you don't need anymore.
Creating a Deployment
Before creating a Deployment define an environment variable for a container.
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx
Pods:
controllers/nginx-deployment.yaml
In this example:
A Deployment named
nginx-deployment
is created, indicated by the.metadata.name
field. This name will become the basis for the ReplicaSets and Pods which are created later. See Writing a Deployment Spec for more details.The Deployment creates a ReplicaSet that creates three replicated Pods, indicated by the
.spec.replicas
field.The
.spec.selector
field defines how the created ReplicaSet finds which Pods to manage. In this case, you select a label that is defined in the Pod template (app: nginx
). However, more sophisticated selection rules are possible, as long as the Pod template itself satisfies the rule.Note: The
.spec.selector.matchLabels
field is a map of {key,value} pairs. A single {key,value} in thematchLabels
map is equivalent to an element ofmatchExpressions
, whosekey
field is "key", theoperator
is "In", and thevalues
array contains only "value". All of the requirements, from bothmatchLabels
andmatchExpressions
, must be satisfied in order to match.The
template
field contains the following sub-fields:The Pods are labeled
app: nginx
using the.metadata.labels
field.The Pod template's specification, or
.template.spec
field, indicates that the Pods run one container,nginx
, which runs thenginx
Docker Hub image at version 1.14.2.Create one container and name it
nginx
using the.spec.template.spec.containers[0].name
field.
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given below to create the above Deployment:
Create the Deployment by running the following command:
Run
kubectl get deployments
to check if the Deployment was created.If the Deployment is still being created, the output is similar to the following:
When you inspect the Deployments in your cluster, the following fields are displayed:
NAME
lists the names of the Deployments in the namespace.READY
displays how many replicas of the application are available to your users. It follows the pattern ready/desired.UP-TO-DATE
displays the number of replicas that have been updated to achieve the desired state.AVAILABLE
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice how the number of desired replicas is 3 according to
.spec.replicas
field.To see the Deployment rollout status, run
kubectl rollout status deployment/nginx-deployment
.The output is similar to:
Run the
kubectl get deployments
again a few seconds later. The output is similar to this:Notice that the Deployment has created all three replicas, and all replicas are up-to-date (they contain the latest Pod template) and available.
To see the ReplicaSet (
rs
) created by the Deployment, runkubectl get rs
. The output is similar to this:ReplicaSet output shows the following fields:
NAME
lists the names of the ReplicaSets in the namespace.DESIRED
displays the desired number of replicas of the application, which you define when you create the Deployment. This is the desired state.CURRENT
displays how many replicas are currently running.READY
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice that the name of the ReplicaSet is always formatted as
[DEPLOYMENT-NAME]-[HASH]
. This name will become the basis for the Pods which are created.The
HASH
string is the same as thepod-template-hash
label on the ReplicaSet.To see the labels automatically generated for each Pod, run
kubectl get pods --show-labels
. The output is similar to:The created ReplicaSet ensures that there are three
nginx
Pods.
Note:
You must specify an appropriate selector and Pod template labels in a Deployment (in this case, app: nginx
).
Do not overlap labels or selectors with other controllers (including other Deployments and StatefulSets). Kubernetes doesn't stop you from overlapping, and if multiple controllers have overlapping selectors those controllers might conflict and behave unexpectedly.
Pod-template-hash label
Caution: Do not change this label.
The pod-template-hash
label is added by the Deployment controller to every ReplicaSet that a Deployment creates or adopts.
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate
of the ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels, and in any existing Pods that the ReplicaSet might have.
Updating a Deployment
Note: A Deployment's rollout is triggered if and only if the Deployment's Pod template (that is, .spec.template
) is changed, for example if the labels or container images of the template are updated. Other updates, such as scaling the Deployment, do not trigger a rollout.
Follow the steps given below to update your Deployment:
Let's update the nginx Pods to use the
nginx:1.16.1
image instead of thenginx:1.14.2
image.or use the following command:
where
deployment/nginx-deployment
indicates the Deployment,nginx
indicates the Container the update will take place andnginx:1.16.1
indicates the new image and its tag.The output is similar to:
Alternatively, you can
edit
the Deployment and change.spec.template.spec.containers[0].image
fromnginx:1.14.2
tonginx:1.16.1
:The output is similar to:
To see the rollout status, run:
The output is similar to this:
or
Get more details on your updated Deployment:
After the rollout succeeds, you can view the Deployment by running
kubectl get deployments
. The output is similar to this:Run
kubectl get rs
to see that the Deployment updated the Pods by creating a new ReplicaSet and scaling it up to 3 replicas, as well as scaling down the old ReplicaSet to 0 replicas.The output is similar to this:
Running
get pods
should now show only the new Pods:The output is similar to this:
Next time you want to update these Pods, you only need to update the Deployment's Pod template again.
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
Get details of your Deployment:
The output is similar to this:
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-deployment-2035384211) and scaled it up to 3 replicas directly. When you updated the Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and scaled it up to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and scaled up the new ReplicaSet to 2 so that at least 3 Pods were available and at most 4 Pods were created at all times. It then continued scaling up and down the new and the old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
Note: Kubernetes doesn't count terminating Pods when calculating the number of availableReplicas
, which must be between replicas - maxUnavailable
and replicas + maxSurge
. As a result, you might notice that there are more Pods than expected during a rollout, and that the total resources consumed by the Deployment is more than replicas + maxSurge
until the terminationGracePeriodSeconds
of the terminating Pods expires.
Rollover (aka multiple updates in-flight)
Each time a new Deployment is observed by the Deployment controller, a ReplicaSet is created to bring up the desired Pods. If the Deployment is updated, the existing ReplicaSet that controls Pods whose labels match .spec.selector
but whose template does not match .spec.template
are scaled down. Eventually, the new ReplicaSet is scaled to .spec.replicas
and all old ReplicaSets is scaled to 0.
If you update a Deployment while an existing rollout is in progress, the Deployment creates a new ReplicaSet as per the update and start scaling that up, and rolls over the ReplicaSet that it was scaling up previously -- it will add it to its list of old ReplicaSets and start scaling it down.
For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2
, but then update the Deployment to create 5 replicas of nginx:1.16.1
, when only 3 replicas of nginx:1.14.2
had been created. In that case, the Deployment immediately starts killing the 3 nginx:1.14.2
Pods that it had created, and starts creating nginx:1.16.1
Pods. It does not wait for the 5 replicas of nginx:1.14.2
to be created before changing course.
Label selector updates
It is generally discouraged to make label selector updates and it is suggested to plan your selectors up front. In any case, if you need to perform a label selector update, exercise great caution and make sure you have grasped all of the implications.
Note: In API version apps/v1
, a Deployment's label selector is immutable after it gets created.
Selector additions require the Pod template labels in the Deployment spec to be updated with the new label too, otherwise a validation error is returned. This change is a non-overlapping one, meaning that the new selector does not select ReplicaSets and Pods created with the old selector, resulting in orphaning all old ReplicaSets and creating a new ReplicaSet.
Selector updates changes the existing value in a selector key -- result in the same behavior as additions.
Selector removals removes an existing key from the Deployment selector -- do not require any changes in the Pod template labels. Existing ReplicaSets are not orphaned, and a new ReplicaSet is not created, but note that the removed label still exists in any existing Pods and ReplicaSets.
Rolling Back a Deployment
Sometimes, you may want to rollback a Deployment; for example, when the Deployment is not stable, such as crash looping. By default, all of the Deployment's rollout history is kept in the system so that you can rollback anytime you want (you can change that by modifying revision history limit).
Note: A Deployment's revision is created when a Deployment's rollout is triggered. This means that the new revision is created if and only if the Deployment's Pod template (.spec.template
) is changed, for example if you update the labels or container images of the template. Other updates, such as scaling the Deployment, do not create a Deployment revision, so that you can facilitate simultaneous manual- or auto-scaling. This means that when you roll back to an earlier revision, only the Deployment's Pod template part is rolled back.
Suppose that you made a typo while updating the Deployment, by putting the image name as
nginx:1.161
instead ofnginx:1.16.1
:The output is similar to this:
The rollout gets stuck. You can verify it by checking the rollout status:
The output is similar to this:
Press Ctrl-C to stop the above rollout status watch. For more information on stuck rollouts, read more here.
You see that the number of old replicas (
nginx-deployment-1564180365
andnginx-deployment-2035384211
) is 2, and new replicas (nginx-deployment-3066724191) is 1.The output is similar to this:
Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an image pull loop.
The output is similar to this:
Note: The Deployment controller stops the bad rollout automatically, and stops scaling up the new ReplicaSet. This depends on the rollingUpdate parameters (
maxUnavailable
specifically) that you have specified. Kubernetes by default sets the value to 25%.Get the description of the Deployment:
The output is similar to this:
To fix this, you need to rollback to a previous revision of Deployment that is stable.
Checking Rollout History of a Deployment
Follow the steps given below to check the rollout history:
First, check the revisions of this Deployment:
The output is similar to this:
CHANGE-CAUSE
is copied from the Deployment annotationkubernetes.io/change-cause
to its revisions upon creation. You can specify theCHANGE-CAUSE
message by:Annotating the Deployment with
kubectl annotate deployment/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"
Manually editing the manifest of the resource.
To see the details of each revision, run:
The output is similar to this:
Rolling Back to a Previous Revision
Follow the steps given below to rollback the Deployment from the current version to the previous version, which is version 2.
Now you've decided to undo the current rollout and rollback to the previous revision:
The output is similar to this:
Alternatively, you can rollback to a specific revision by specifying it with
--to-revision
:The output is similar to this:
For more details about rollout related commands, read
kubectl rollout
.The Deployment is now rolled back to a previous stable revision. As you can see, a
DeploymentRollback
event for rolling back to revision 2 is generated from Deployment controller.Check if the rollback was successful and the Deployment is running as expected, run:
The output is similar to this:
Get the description of the Deployment:
The output is similar to this:
Scaling a Deployment
You can scale a Deployment by using the following command:
The output is similar to this:
Assuming horizontal Pod autoscaling is enabled in your cluster, you can set up an autoscaler for your Deployment and choose the minimum and maximum number of Pods you want to run based on the CPU utilization of your existing Pods.
The output is similar to this:
Proportional scaling
RollingUpdate Deployments support running multiple versions of an application at the same time. When you or an autoscaler scales a RollingUpdate Deployment that is in the middle of a rollout (either in progress or paused), the Deployment controller balances the additional replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This is called proportional scaling.
For example, you are running a Deployment with 10 replicas, maxSurge=3, and maxUnavailable=2.
Ensure that the 10 replicas in your Deployment are running.
The output is similar to this:
You update to a new image which happens to be unresolvable from inside the cluster.
The output is similar to this:
The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but it's blocked due to the
maxUnavailable
requirement that you mentioned above. Check out the rollout status:The output is similar to this:
Then a new scaling request for the Deployment comes along. The autoscaler increments the Deployment replicas to 15. The Deployment controller needs to decide where to add these new 5 replicas. If you weren't using proportional scaling, all 5 of them would be added in the new ReplicaSet. With proportional scaling, you spread the additional replicas across all ReplicaSets. Bigger proportions go to the ReplicaSets with the most replicas and lower proportions go to ReplicaSets with less replicas. Any leftovers are added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the new ReplicaSet. The rollout process should eventually move all replicas to the new ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
The output is similar to this:
The rollout status confirms how the replicas were added to each ReplicaSet.
The output is similar to this:
Pausing and Resuming a rollout of a Deployment
When you update a Deployment, or plan to, you can pause rollouts for that Deployment before you trigger one or more updates. When you're ready to apply those changes, you resume rollouts for the Deployment. This approach allows you to apply multiple fixes in between pausing and resuming without triggering unnecessary rollouts.
For example, with a Deployment that was created:
Get the Deployment details:
The output is similar to this:
Get the rollout status:
The output is similar to this:
Pause by running the following command:
The output is similar to this:
Then update the image of the Deployment:
The output is similar to this:
Notice that no new rollout started:
The output is similar to this:
Get the rollout status to verify that the existing ReplicaSet has not changed:
The output is similar to this:
You can make as many updates as you wish, for example, update the resources that will be used:
The output is similar to this:
The initial state of the Deployment prior to pausing its rollout will continue its function, but new updates to the Deployment will not have any effect as long as the Deployment rollout is paused.
Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up with all the new updates:
The output is similar to this:
Watch the status of the rollout until it's done.
The output is similar to this:
Get the status of the latest rollout:
The output is similar to this:
Note: You cannot rollback a paused Deployment until you resume it.
Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out a new ReplicaSet, it can be complete, or it can fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
The Deployment creates a new ReplicaSet.
The Deployment is scaling up its newest ReplicaSet.
The Deployment is scaling down its older ReplicaSet(s).
New Pods become ready or available (ready for at least MinReadySeconds).
When the rollout becomes “progressing”, the Deployment controller adds a condition with the following attributes to the Deployment's .status.conditions
:
type: Progressing
status: "True"
reason: NewReplicaSetCreated
|reason: FoundNewReplicaSet
|reason: ReplicaSetUpdated
You can monitor the progress for a Deployment by using kubectl rollout status
.
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following characteristics:
All of the replicas associated with the Deployment have been updated to the latest version you've specified, meaning any updates you've requested have been completed.
All of the replicas associated with the Deployment are available.
No old replicas for the Deployment are running.
When the rollout becomes “complete”, the Deployment controller sets a condition with the following attributes to the Deployment's .status.conditions
:
type: Progressing
status: "True"
reason: NewReplicaSetAvailable
This Progressing
condition will retain a status value of "True"
until a new rollout is initiated. The condition holds even when availability of replicas changes (which does instead affect the Available
condition).
You can check if a Deployment has completed by using kubectl rollout status
. If the rollout completed successfully, kubectl rollout status
returns a zero exit code.
The output is similar to this:
and the exit status from kubectl rollout
is 0 (success):
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the following factors:
Insufficient quota
Readiness probe failures
Image pull errors
Insufficient permissions
Limit ranges
Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in your Deployment spec: (.spec.progressDeadlineSeconds
). .spec.progressDeadlineSeconds
denotes the number of seconds the Deployment controller waits before indicating (in the Deployment status) that the Deployment progress has stalled.
The following kubectl
command sets the spec with progressDeadlineSeconds
to make the controller report lack of progress of a rollout for a Deployment after 10 minutes:
The output is similar to this:
Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition with the following attributes to the Deployment's .status.conditions
:
type: Progressing
status: "False"
reason: ProgressDeadlineExceeded
This condition can also fail early and is then set to status value of "False"
due to reasons as ReplicaSetCreateError
. Also, the deadline is not taken into account anymore once the Deployment rollout completes.
See the Kubernetes API conventions for more information on status conditions.
Note: Kubernetes takes no action on a stalled Deployment other than to report a status condition with reason: ProgressDeadlineExceeded
. Higher level orchestrators can take advantage of it and act accordingly, for example, rollback the Deployment to its previous version.Note: If you pause a Deployment rollout, Kubernetes does not check progress against your specified deadline. You can safely pause a Deployment rollout in the middle of a rollout and resume without triggering the condition for exceeding the deadline.
You may experience transient errors with your Deployments, either due to a low timeout that you have set or due to any other kind of error that can be treated as transient. For example, let's suppose you have insufficient quota. If you describe the Deployment you will notice the following section:
The output is similar to this:
If you run kubectl get deployment nginx-deployment -o yaml
, the Deployment status is similar to this:
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status and the reason for the Progressing condition:
You can address an issue of insufficient quota by scaling down your Deployment, by scaling down other controllers you may be running, or by increasing quota in your namespace. If you satisfy the quota conditions and the Deployment controller then completes the Deployment rollout, you'll see the Deployment's status update with a successful condition (status: "True"
and reason: NewReplicaSetAvailable
).
type: Available
with status: "True"
means that your Deployment has minimum availability. Minimum availability is dictated by the parameters specified in the deployment strategy. type: Progressing
with status: "True"
means that your Deployment is either in the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum required new replicas are available (see the Reason of the condition for the particulars - in our case reason: NewReplicaSetAvailable
means that the Deployment is complete).
You can check if a Deployment has failed to progress by using kubectl rollout status
. kubectl rollout status
returns a non-zero exit code if the Deployment has exceeded the progression deadline.
The output is similar to this:
and the exit status from kubectl rollout
is 1 (indicating an error):
Operating on a failed deployment
All actions that apply to a complete Deployment also apply to a failed Deployment. You can scale it up/down, roll back to a previous revision, or even pause it if you need to apply multiple tweaks in the Deployment Pod template.
Clean up Policy
You can set .spec.revisionHistoryLimit
field in a Deployment to specify how many old ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the background. By default, it is 10.
Note: Explicitly setting this field to 0, will result in cleaning up all the history of your Deployment thus that Deployment will not be able to roll back.
Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can create multiple Deployments, one for each release, following the canary pattern described in managing resources.
Writing a Deployment Spec
As with all other Kubernetes configs, a Deployment needs .apiVersion
, .kind
, and .metadata
fields. For general information about working with config files, see deploying applications, configuring containers, and using kubectl to manage resources documents.
When the control plane creates new Pods for a Deployment, the .metadata.name
of the Deployment is part of the basis for naming those Pods. The name of a Deployment must be a valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.
A Deployment also needs a .spec
section.
Pod Template
The .spec.template
and .spec.selector
are the only required fields of the .spec
.
The .spec.template
is a Pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a Pod template in a Deployment must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See selector.
Only a .spec.template.spec.restartPolicy
equal to Always
is allowed, which is the default if not specified.
Replicas
.spec.replicas
is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a Deployment, example via kubectl scale deployment deployment --replicas=X
, and then you update that Deployment based on a manifest (for example: by running kubectl apply -f deployment.yaml
), then applying that manifest overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for a Deployment, don't set .spec.replicas
.
Instead, allow the Kubernetes control plane to manage the .spec.replicas
field automatically.
Selector
.spec.selector
is a required field that specifies a label selector for the Pods targeted by this Deployment.
.spec.selector
must match .spec.template.metadata.labels
, or it will be rejected by the API.
In API version apps/v1
, .spec.selector
and .metadata.labels
do not default to .spec.template.metadata.labels
if not set. So they must be set explicitly. Also note that .spec.selector
is immutable after creation of the Deployment in apps/v1
.
A Deployment may terminate Pods whose labels match the selector if their template is different from .spec.template
or if the total number of such Pods exceeds .spec.replicas
. It brings up new Pods with .spec.template
if the number of Pods is less than the desired number.
Note: You should not create other Pods whose labels match this selector, either directly, by creating another Deployment, or by creating another controller such as a ReplicaSet or a ReplicationController. If you do so, the first Deployment thinks that it created these other Pods. Kubernetes does not stop you from doing this.
If you have multiple controllers that have overlapping selectors, the controllers will fight with each other and won't behave correctly.
Strategy
.spec.strategy
specifies the strategy used to replace old Pods by new ones. .spec.strategy.type
can be "Recreate" or "RollingUpdate". "RollingUpdate" is the default value.
Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate
.
Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.
Rolling Update Deployment
The Deployment updates Pods in a rolling update fashion when .spec.strategy.type==RollingUpdate
. You can specify maxUnavailable
and maxSurge
to control the rolling update process.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable
is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge
is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
.spec.strategy.rollingUpdate.maxSurge
is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable
is 0. The absolute number is calculated from the percentage by rounding up. The default value is 25%.
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
Progress Deadline Seconds
.spec.progressDeadlineSeconds
is an optional field that specifies the number of seconds you want to wait for your Deployment to progress before the system reports back that the Deployment has failed progressing - surfaced as a condition with type: Progressing
, status: "False"
. and reason: ProgressDeadlineExceeded
in the status of the resource. The Deployment controller will keep retrying the Deployment. This defaults to 600. In the future, once automatic rollback will be implemented, the Deployment controller will roll back a Deployment as soon as it observes such a condition.
If specified, this field needs to be greater than .spec.minReadySeconds
.
Min Ready Seconds
.spec.minReadySeconds
is an optional field that specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing, for it to be considered available. This defaults to 0 (the Pod will be considered available as soon as it is ready). To learn more about when a Pod is considered ready, see Container Probes.
Revision History Limit
A Deployment's revision history is stored in the ReplicaSets it controls.
.spec.revisionHistoryLimit
is an optional field that specifies the number of old ReplicaSets to retain to allow rollback. These old ReplicaSets consume resources in etcd
and crowd the output of kubectl get rs
. The configuration of each Deployment revision is stored in its ReplicaSets; therefore, once an old ReplicaSet is deleted, you lose the ability to rollback to that revision of Deployment. By default, 10 old ReplicaSets will be kept, however its ideal value depends on the frequency and stability of new Deployments.
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be cleaned up. In this case, a new Deployment rollout cannot be undone, since its revision history is cleaned up.
Paused
.spec.paused
is an optional boolean field for pausing and resuming a Deployment. The only difference between a paused Deployment and one that is not paused, is that any changes into the PodTemplateSpec of the paused Deployment will not trigger new rollouts as long as it is paused. A Deployment is not paused by default when it is created.
StatefulSets
StatefulSet is the workload API object used to manage stateful applications.
Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods that replace any that have failed.
Using StatefulSets
StatefulSets are valuable for applications that require one or more of the following.
Stable, unique network identifiers.
Stable, persistent storage.
Ordered, graceful deployment and scaling.
Ordered, automated rolling updates.
In the above, stable is synonymous with persistence across Pod (re)scheduling. If an application doesn't require any stable identifiers or ordered deployment, deletion, or scaling, you should deploy your application using a workload object that provides a set of stateless replicas. Deployment or ReplicaSet may be better suited to your stateless needs.
Limitations
The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner based on the requested
storage class
, or pre-provisioned by an admin.Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for creating this Service.
StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet is deleted. To achieve ordered and graceful termination of the pods in the StatefulSet, it is possible to scale the StatefulSet down to 0 prior to deletion.
When using Rolling Updates with the default Pod Management Policy (
OrderedReady
), it's possible to get into a broken state that requires manual intervention to repair.
Components
The example below demonstrates the components of a StatefulSet.
In the above example:
A Headless Service, named
nginx
, is used to control the network domain.The StatefulSet, named
web
, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods.The
volumeClaimTemplates
will provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.
The name of a StatefulSet object must be a valid DNS label.
Pod Selector
You must set the .spec.selector
field of a StatefulSet to match the labels of its .spec.template.metadata.labels
. Failing to specify a matching Pod Selector will result in a validation error during StatefulSet creation.
Volume Claim Templates
You can set the .spec.volumeClaimTemplates
which can provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.
Minimum ready seconds
FEATURE STATE: Kubernetes v1.25 [stable]
.spec.minReadySeconds
is an optional field that specifies the minimum number of seconds for which a newly created Pod should be running and ready without any of its containers crashing, for it to be considered available. This is used to check progression of a rollout when using a Rolling Update strategy. This field defaults to 0 (the Pod will be considered available as soon as it is ready). To learn more about when a Pod is considered ready, see Container Probes.
Pod Identity
StatefulSet Pods have a unique identity that consists of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled on.
Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer ordinal, that is unique over the Set. By default, pods will be assigned ordinals from 0 up through N-1.
Start ordinal
FEATURE STATE: Kubernetes v1.27 [beta]
.spec.ordinals
is an optional field that allows you to configure the integer ordinals assigned to each Pod. It defaults to nil. You must enable the StatefulSetStartOrdinal
feature gate to use this field. Once enabled, you can configure the following options:
.spec.ordinals.start
: If the.spec.ordinals.start
field is set, Pods will be assigned ordinals from.spec.ordinals.start
up through.spec.ordinals.start + .spec.replicas - 1
.
Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal of the Pod. The pattern for the constructed hostname is $(statefulset name)-$(ordinal)
. The example above will create three Pods named web-0,web-1,web-2
. A StatefulSet can use a Headless Service to control the domain of its Pods. The domain managed by this Service takes the form: $(service name).$(namespace).svc.cluster.local
, where "cluster.local" is the cluster domain. As each Pod is created, it gets a matching DNS subdomain, taking the form: $(podname).$(governing service domain)
, where the governing service is defined by the serviceName
field on the StatefulSet.
Depending on how DNS is configured in your cluster, you may not be able to look up the DNS name for a newly-run Pod immediately. This behavior can occur when other clients in the cluster have already sent queries for the hostname of the Pod before it was created. Negative caching (normal in DNS) means that the results of previous failed lookups are remembered and reused, even after the Pod is running, for at least a few seconds.
If you need to discover Pods promptly after they are created, you have a few options:
Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.
Decrease the time of caching in your Kubernetes DNS provider (typically this means editing the config map for CoreDNS, which currently caches for 30 seconds).
As mentioned in the limitations section, you are responsible for creating the Headless Service responsible for the network identity of the pods.
Here are some examples of choices for Cluster Domain, Service name, StatefulSet name, and how that affects the DNS names for the StatefulSet's Pods.
Cluster Domain | Service (ns/name) | StatefulSet (ns/name) | StatefulSet Domain | Pod DNS | Pod Hostname |
---|---|---|---|---|---|
cluster.local | default/nginx | default/web | nginx.default.svc.cluster.local | web-{0..N-1}.nginx.default.svc.cluster.local | web-{0..N-1} |
cluster.local | foo/nginx | foo/web | nginx.foo.svc.cluster.local | web-{0..N-1}.nginx.foo.svc.cluster.local | web-{0..N-1} |
kube.local | foo/nginx | foo/web | nginx.foo.svc.kube.local | web-{0..N-1}.nginx.foo.svc.kube.local | web-{0..N-1} |
Note: Cluster Domain will be set to cluster.local
unless otherwise configured.
Stable Storage
For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives one PersistentVolumeClaim. In the nginx example above, each Pod receives a single PersistentVolume with a StorageClass of my-storage-class
and 1 Gib of provisioned storage. If no StorageClass is specified, then the default StorageClass will be used. When a Pod is (re)scheduled onto a node, its volumeMounts
mount the PersistentVolumes associated with its PersistentVolume Claims. Note that, the PersistentVolumes associated with the Pods' PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted. This must be done manually.
Pod Name Label
When the StatefulSet controller creates a Pod, it adds a label, statefulset.kubernetes.io/pod-name
, that is set to the name of the Pod. This label allows you to attach a Service to a specific Pod in the StatefulSet.
Deployment and Scaling Guarantees
For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
Before a Pod is terminated, all of its successors must be completely shutdown.
The StatefulSet should not specify a pod.Spec.TerminationGracePeriodSeconds
of 0. This practice is unsafe and strongly discouraged. For further explanation, please refer to force deleting StatefulSet Pods.
When the nginx example above is created, three Pods will be deployed in the order web-0, web-1, web-2. web-1 will not be deployed before web-0 is Running and Ready, and web-2 will not be deployed until web-1 is Running and Ready. If web-0 should fail, after web-1 is Running and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is successfully relaunched and becomes Running and Ready.
If a user were to scale the deployed example by patching the StatefulSet such that replicas=1
, web-2 would be terminated first. web-1 would not be terminated until web-2 is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated and is completely shutdown, but prior to web-1's termination, web-1 would not be terminated until web-0 is Running and Ready.
Pod Management Policies
StatefulSet allows you to relax its ordering guarantees while preserving its uniqueness and identity guarantees via its .spec.podManagementPolicy
field.
OrderedReady Pod Management
OrderedReady
pod management is the default for StatefulSets. It implements the behavior described above.
Parallel Pod Management
Parallel
pod management tells the StatefulSet controller to launch or terminate all Pods in parallel, and to not wait for Pods to become Running and Ready or completely terminated prior to launching or terminating another Pod. This option only affects the behavior for scaling operations. Updates are not affected.
Update strategies
A StatefulSet's .spec.updateStrategy
field allows you to configure and disable automated rolling updates for containers, labels, resource request/limits, and annotations for the Pods in a StatefulSet. There are two possible values:
OnDelete
When a StatefulSet's .spec.updateStrategy.type
is set to OnDelete
, the StatefulSet controller will not automatically update the Pods in a StatefulSet. Users must manually delete Pods to cause the controller to create new Pods that reflect modifications made to a StatefulSet's .spec.template
.RollingUpdate
The RollingUpdate
update strategy implements automated, rolling updates for the Pods in a StatefulSet. This is the default update strategy.
Rolling Updates
When a StatefulSet's .spec.updateStrategy.type
is set to RollingUpdate
, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time.
The Kubernetes control plane waits until an updated Pod is Running and Ready prior to updating its predecessor. If you have set .spec.minReadySeconds
(see Minimum Ready Seconds), the control plane additionally waits that amount of time after the Pod turns ready, before moving on.
Partitioned rolling updates
The RollingUpdate
update strategy can be partitioned, by specifying a .spec.updateStrategy.rollingUpdate.partition
. If a partition is specified, all Pods with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet's .spec.template
is updated. All Pods with an ordinal that is less than the partition will not be updated, and, even if they are deleted, they will be recreated at the previous version. If a StatefulSet's .spec.updateStrategy.rollingUpdate.partition
is greater than its .spec.replicas
, updates to its .spec.template
will not be propagated to its Pods. In most cases you will not need to use a partition, but they are useful if you want to stage an update, roll out a canary, or perform a phased roll out.
Maximum unavailable Pods
FEATURE STATE: Kubernetes v1.24 [alpha]
You can control the maximum number of Pods that can be unavailable during an update by specifying the .spec.updateStrategy.rollingUpdate.maxUnavailable
field. The value can be an absolute number (for example, 5
) or a percentage of desired Pods (for example, 10%
). Absolute number is calculated from the percentage value by rounding it up. This field cannot be 0. The default setting is 1.
This field applies to all Pods in the range 0
to replicas - 1
. If there is any unavailable Pod in the range 0
to replicas - 1
, it will be counted towards maxUnavailable
.
Note: The maxUnavailable
field is in Alpha stage and it is honored only by API servers that are running with the MaxUnavailableStatefulSet
feature gate enabled.
Forced rollback
When using Rolling Updates with the default Pod Management Policy (OrderedReady
), it's possible to get into a broken state that requires manual intervention to repair.
If you update the Pod template to a configuration that never becomes Running and Ready (for example, due to a bad binary or application-level configuration error), StatefulSet will stop the rollout and wait.
In this state, it's not enough to revert the Pod template to a good configuration. Due to a known issue, StatefulSet will continue to wait for the broken Pod to become Ready (which never happens) before it will attempt to revert it back to the working configuration.
After reverting the template, you must also delete any Pods that StatefulSet had already attempted to run with the bad configuration. StatefulSet will then begin to recreate the Pods using the reverted template.
PersistentVolumeClaim retention
FEATURE STATE: Kubernetes v1.27 [beta]
The optional .spec.persistentVolumeClaimRetentionPolicy
field controls if and how PVCs are deleted during the lifecycle of a StatefulSet. You must enable the StatefulSetAutoDeletePVC
feature gate on the API server and the controller manager to use this field. Once enabled, there are two policies you can configure for each StatefulSet:
whenDeleted
configures the volume retention behavior that applies when the StatefulSet is deletedwhenScaled
configures the volume retention behavior that applies when the replica count of the StatefulSet is reduced; for example, when scaling down the set.
For each policy that you can configure, you can set the value to either Delete
or Retain
.
Delete
The PVCs created from the StatefulSet volumeClaimTemplate
are deleted for each Pod affected by the policy. With the whenDeleted
policy all PVCs from the volumeClaimTemplate
are deleted after their Pods have been deleted. With the whenScaled
policy, only PVCs corresponding to Pod replicas being scaled down are deleted, after their Pods have been deleted.Retain
(default)PVCs from the volumeClaimTemplate
are not affected when their Pod is deleted. This is the behavior before this new feature.
Bear in mind that these policies only apply when Pods are being removed due to the StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node where the new Pod is about to launch.
The default for policies is Retain
, matching the StatefulSet behavior before this new feature.
Here is an example policy.
The StatefulSet controller adds owner references to its PVCs, which are then deleted by the garbage collector after the Pod is terminated. This enables the Pod to cleanly unmount all volumes before the PVCs are deleted (and before the backing PV and volume are deleted, depending on the retain policy). When you set the whenDeleted
policy to Delete
, an owner reference to the StatefulSet instance is placed on all PVCs associated with that StatefulSet.
The whenScaled
policy must delete PVCs only when a Pod is scaled down, and not when a Pod is deleted for another reason. When reconciling, the StatefulSet controller compares its desired replica count to the actual Pods present on the cluster. Any StatefulSet Pod whose id greater than the replica count is condemned and marked for deletion. If the whenScaled
policy is Delete
, the condemned Pods are first set as owners to the associated StatefulSet template PVCs, before the Pod is deleted. This causes the PVCs to be garbage collected after only the condemned Pods have terminated.
This means that if the controller crashes and restarts, no Pod will be deleted before its owner reference has been updated appropriate to the policy. If a condemned Pod is force-deleted while the controller is down, the owner reference may or may not have been set up, depending on when the controller crashed. It may take several reconcile loops to update the owner references, so some condemned Pods may have set up owner references and others may not. For this reason we recommend waiting for the controller to come back up, which will verify owner references before terminating Pods. If that is not possible, the operator should verify the owner references on PVCs to ensure the expected objects are deleted when Pods are force-deleted.
Replicas
.spec.replicas
is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a deployment, example via kubectl scale statefulset statefulset --replicas=X
, and then you update that StatefulSet based on a manifest (for example: by running kubectl apply -f statefulset.yaml
), then applying that manifest overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for a Statefulset, don't set .spec.replicas
. Instead, allow the Kubernetes control plane to manage the .spec.replicas
field automatically.
DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
Some typical uses of a DaemonSet are:
running a cluster storage daemon on every node
running a logs collection daemon on every node
running a node monitoring daemon on every node
In a simple case, one DaemonSet, covering all nodes, would be used for each type of daemon. A more complex setup might use multiple DaemonSets for a single type of daemon, but with different flags and/or different memory and cpu requests for different hardware types.
Writing a DaemonSet Spec
Create a DaemonSet
You can describe a DaemonSet in a YAML file. For example, the daemonset.yaml
file below describes a DaemonSet that runs the fluentd-elasticsearch Docker image:
Create a DaemonSet based on the YAML file:
Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion
, kind
, and metadata
fields. For general information about working with config files, see running stateless applications and object management using kubectl.
The name of a DaemonSet object must be a valid DNS subdomain name.
A DaemonSet also needs a .spec
section.
Pod Template
The .spec.template
is one of the required fields in .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a Pod template in a DaemonSet has to specify appropriate labels (see pod selector).
A Pod Template in a DaemonSet must have a RestartPolicy
equal to Always
, or be unspecified, which defaults to Always
.
Pod Selector
The .spec.selector
field is a pod selector. It works the same as the .spec.selector
of a Job.
You must specify a pod selector that matches the labels of the .spec.template
. Also, once a DaemonSet is created, its .spec.selector
can not be mutated. Mutating the pod selector can lead to the unintentional orphaning of Pods, and it was found to be confusing to users.
The .spec.selector
is an object consisting of two fields:
matchLabels
- works the same as the.spec.selector
of a ReplicationController.matchExpressions
- allows to build more sophisticated selectors by specifying key, list of values and an operator that relates the key and values.
When the two are specified the result is ANDed.
The .spec.selector
must match the .spec.template.metadata.labels
. Config with these two not matching will be rejected by the API.
Running Pods on select Nodes
If you specify a .spec.template.spec.nodeSelector
, then the DaemonSet controller will create Pods on nodes which match that node selector. Likewise if you specify a .spec.template.spec.affinity
, then DaemonSet controller will create Pods on nodes which match that node affinity. If you do not specify either, then the DaemonSet controller will create Pods on all nodes.
How Daemon Pods are scheduled
A DaemonSet ensures that all eligible nodes run a copy of a Pod. The DaemonSet controller creates a Pod for each eligible node and adds the spec.affinity.nodeAffinity
field of the Pod to match the target host. After the Pod is created, the default scheduler typically takes over and then binds the Pod to the target host by setting the .spec.nodeName
field. If the new Pod cannot fit on the node, the default scheduler may preempt (evict) some of the existing Pods based on the priority of the new Pod.
The user can specify a different scheduler for the Pods of the DaemonSet, by setting the .spec.template.spec.schedulerName
field of the DaemonSet.
The original node affinity specified at the .spec.template.spec.affinity.nodeAffinity
field (if specified) is taken into consideration by the DaemonSet controller when evaluating the eligible nodes, but is replaced on the created Pod with the node affinity that matches the name of the eligible node.
Taints and tolerations
The DaemonSet controller automatically adds a set of tolerations to DaemonSet Pods:
Toleration key | Effect | Details |
---|---|---|
| DaemonSet Pods can be scheduled onto nodes that are not healthy or ready to accept Pods. Any DaemonSet Pods running on such nodes will not be evicted. | |
| DaemonSet Pods can be scheduled onto nodes that are unreachable from the node controller. Any DaemonSet Pods running on such nodes will not be evicted. | |
| DaemonSet Pods can be scheduled onto nodes with disk pressure issues. | |
| DaemonSet Pods can be scheduled onto nodes with memory pressure issues. | |
| DaemonSet Pods can be scheduled onto nodes with process pressure issues. | |
| DaemonSet Pods can be scheduled onto nodes that are unschedulable. | |
| Only added for DaemonSet Pods that request host networking, i.e., Pods having |
You can add your own tolerations to the Pods of a DaemonSet as well, by defining these in the Pod template of the DaemonSet.
Because the DaemonSet controller sets the node.kubernetes.io/unschedulable:NoSchedule
toleration automatically, Kubernetes can run DaemonSet Pods on nodes that are marked as unschedulable.
If you use a DaemonSet to provide an important node-level function, such as cluster networking, it is helpful that Kubernetes places DaemonSet Pods on nodes before they are ready. For example, without that special toleration, you could end up in a deadlock situation where the node is not marked as ready because the network plugin is not running there, and at the same time the network plugin is not running on that node because the node is not yet ready.
Communicating with Daemon Pods
Some possible patterns for communicating with Pods in a DaemonSet are:
Push: Pods in the DaemonSet are configured to send updates to another service, such as a stats database. They do not have clients.
NodeIP and Known Port: Pods in the DaemonSet can use a
hostPort
, so that the pods are reachable via the node IPs. Clients know the list of node IPs somehow, and know the port by convention.DNS: Create a headless service with the same pod selector, and then discover DaemonSets using the
endpoints
resource or retrieve multiple A records from DNS.Service: Create a service with the same Pod selector, and use the service to reach a daemon on a random node. (No way to reach specific node.)
Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly matching nodes and delete Pods from newly not-matching nodes.
You can modify the Pods that a DaemonSet creates. However, Pods do not allow all fields to be updated. Also, the DaemonSet controller will use the original template the next time a node (even with the same name) is created.
You can delete a DaemonSet. If you specify --cascade=orphan
with kubectl
, then the Pods will be left on the nodes. If you subsequently create a new DaemonSet with the same selector, the new DaemonSet adopts the existing Pods. If any Pods need replacing the DaemonSet replaces them according to its updateStrategy
.
You can perform a rolling update on a DaemonSet.
Jobs
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate. As pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (ie, Job) is complete. Deleting a Job will clean up the Pods it created. Suspending a Job will delete its active Pods until the Job is resumed again.
A simple case is to create one Job object in order to reliably run one Pod to completion. The Job object will start a new Pod if the first Pod fails or is deleted (for example due to a node hardware failure or a node reboot).
You can also use a Job to run multiple Pods in parallel.
If you want to run a Job (either a single task, or several in parallel) on a schedule, see CronJob.
Running an example Job
Here is an example Job config. It computes π to 2000 places and prints it out. It takes around 10s to complete.
You can run the example with this command:
The output is similar to this:
Check on the status of the Job with kubectl
:
To view completed Pods of a Job, use kubectl get pods
.
To list all the Pods that belong to a Job in a machine readable form, you can use a command like this:
The output is similar to this:
Here, the selector is the same as the selector for the Job. The --output=jsonpath
option specifies an expression with the name from each Pod in the returned list.
View the standard output of one of the pods:
Another way to view the logs of a Job:
The output is similar to this:
Writing a Job spec
As with all other Kubernetes config, a Job needs apiVersion
, kind
, and metadata
fields.
When the control plane creates new Pods for a Job, the .metadata.name
of the Job is part of the basis for naming those Pods. The name of a Job must be a valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label. Even when the name is a DNS subdomain, the name must be no longer than 63 characters.
A Job also needs a .spec
section.
Job Labels
Job labels will have batch.kubernetes.io/
prefix for job-name
and controller-uid
.
Pod Template
The .spec.template
is the only required field of the .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a pod template in a Job must specify appropriate labels (see pod selector) and an appropriate restart policy.
Only a RestartPolicy
equal to Never
or OnFailure
is allowed.
Pod selector
The .spec.selector
field is optional. In almost all cases you should not specify it. See section specifying your own pod selector.
Parallel execution for Jobs
There are three main types of task suitable to run as a Job:
Non-parallel Jobs
normally, only one Pod is started, unless the Pod fails.
the Job is complete as soon as its Pod terminates successfully.
Parallel Jobs with a fixed completion count:
specify a non-zero positive value for
.spec.completions
.the Job represents the overall task, and is complete when there are
.spec.completions
successful Pods.when using
.spec.completionMode="Indexed"
, each Pod gets a different index in the range 0 to.spec.completions-1
.
Parallel Jobs with a work queue:
do not specify
.spec.completions
, default to.spec.parallelism
.the Pods must coordinate amongst themselves or an external service to determine what each should work on. For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are created.
once at least one Pod has terminated with success and all Pods are terminated, then the Job is completed with success.
once any Pod has exited with success, no other Pod should still be doing any work for this task or writing any output. They should all be in the process of exiting.
For a non-parallel Job, you can leave both .spec.completions
and .spec.parallelism
unset. When both are unset, both are defaulted to 1.
For a fixed completion count Job, you should set .spec.completions
to the number of completions needed. You can set .spec.parallelism
, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions
unset, and set .spec.parallelism
to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
Controlling parallelism
The requested parallelism (.spec.parallelism
) can be set to any non-negative value. If it is unspecified, it defaults to 1. If it is specified as 0, then the Job is effectively paused until it is increased.
Actual parallelism (number of pods running at any instant) may be more or less than requested parallelism, for a variety of reasons:
For fixed completion count Jobs, the actual number of pods running in parallel will not exceed the number of remaining completions. Higher values of
.spec.parallelism
are effectively ignored.For work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining Pods are allowed to complete, however.
If the Job Controller has not had time to react.
If the Job controller failed to create Pods for any reason (lack of
ResourceQuota
, lack of permission, etc.), then there may be fewer pods than requested.The Job controller may throttle new Pod creation due to excessive previous pod failures in the same Job.
When a Pod is gracefully shut down, it takes time to stop.
Completion mode
FEATURE STATE: Kubernetes v1.24 [stable]
Jobs with fixed completion count - that is, jobs that have non null .spec.completions
- can have a completion mode that is specified in .spec.completionMode
:
NonIndexed
(default): the Job is considered complete when there have been.spec.completions
successfully completed Pods. In other words, each Pod completion is homologous to each other. Note that Jobs that have null.spec.completions
are implicitlyNonIndexed
.Indexed
: the Pods of a Job get an associated completion index from 0 to.spec.completions-1
. The index is available through three mechanisms:The Pod annotation
batch.kubernetes.io/job-completion-index
.As part of the Pod hostname, following the pattern
$(job-name)-$(index)
. When you use an Indexed Job in combination with a Service, Pods within the Job can use the deterministic hostnames to address each other via DNS. For more information about how to configure this, see Job with Pod-to-Pod Communication.From the containerized task, in the environment variable
JOB_COMPLETION_INDEX
.
The Job is considered complete when there is one successfully completed Pod for each index. For more information about how to use this mode, see Indexed Job for Parallel Processing with Static Work Assignment.
Note: Although rare, more than one Pod could be started for the same index (due to various reasons such as node failures, kubelet restarts, or Pod evictions). In this case, only the first Pod that completes successfully will count towards the completion count and update the status of the Job. The other Pods that are running or completed for the same index will be deleted by the Job controller once they are detected.
Handling Pod and container failures
A container in a Pod may fail for a number of reasons, such as because the process in it exited with a non-zero exit code, or the container was killed for exceeding a memory limit, etc. If this happens, and the .spec.template.spec.restartPolicy = "OnFailure"
, then the Pod stays on the node, but the container is re-run. Therefore, your program needs to handle the case when it is restarted locally, or else specify .spec.template.spec.restartPolicy = "Never"
. See pod lifecycle for more information on restartPolicy
.
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the .spec.template.spec.restartPolicy = "Never"
. When a Pod fails, then the Job controller starts a new Pod. This means that your application needs to handle the case when it is restarted in a new pod. In particular, it needs to handle temporary files, locks, incomplete output and the like caused by previous runs.
By default, each pod failure is counted towards the .spec.backoffLimit
limit, see pod backoff failure policy. However, you can customize handling of pod failures by setting the Job's pod failure policy.
Note that even if you specify .spec.parallelism = 1
and .spec.completions = 1
and .spec.template.spec.restartPolicy = "Never"
, the same program may sometimes be started twice.
If you do specify .spec.parallelism
and .spec.completions
both greater than 1, then there may be multiple pods running at once. Therefore, your pods must also be tolerant of concurrency.
When the feature gates PodDisruptionConditions
and JobPodFailurePolicy
are both enabled, and the .spec.podFailurePolicy
field is set, the Job controller does not consider a terminating Pod (a pod that has a .metadata.deletionTimestamp
field set) as a failure until that Pod is terminal (its .status.phase
is Failed
or Succeeded
). However, the Job controller creates a replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the Job controller evaluates .backoffLimit
and .podFailurePolicy
for the relevant Job, taking this now-terminated Pod into consideration.
If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure, even if that Pod later terminates with phase: "Succeeded"
.
Pod backoff failure policy
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit
to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.
The number of retries is calculated in two ways:
The number of Pods with
.status.phase = "Failed"
.When using
restartPolicy = "OnFailure"
, the number of retries in all the containers of Pods with.status.phase
equal toPending
orRunning
.
If either of the calculations reaches the .spec.backoffLimit
, the Job is considered failed.
Note: If your job has restartPolicy = "OnFailure"
, keep in mind that your Pod running the Job will be terminated once the job backoff limit has been reached. This can make debugging the Job's executable more difficult. We suggest setting restartPolicy = "Never"
when debugging the Job or using a logging system to ensure output from failed Jobs is not lost inadvertently.
Pod failure policy
FEATURE STATE: Kubernetes v1.26 [beta]
Note: You can only configure a Pod failure policy for a Job if you have the JobPodFailurePolicy
feature gate enabled in your cluster. Additionally, it is recommended to enable the PodDisruptionConditions
feature gate in order to be able to detect and handle Pod disruption conditions in the Pod failure policy (see also: Pod disruption conditions). Both feature gates are available in Kubernetes 1.27.
A Pod failure policy, defined with the .spec.podFailurePolicy
field, enables your cluster to handle Pod failures based on the container exit codes and the Pod conditions.
In some situations, you may want to have a better control when handling Pod failures than the control provided by the Pod backoff failure policy, which is based on the Job's .spec.backoffLimit
. These are some examples of use cases:
To optimize costs of running workloads by avoiding unnecessary Pod restarts, you can terminate a Job as soon as one of its Pods fails with an exit code indicating a software bug.
To guarantee that your Job finishes even if there are disruptions, you can ignore Pod failures caused by disruptions (such preemption, API-initiated eviction or taint-based eviction) so that they don't count towards the
.spec.backoffLimit
limit of retries.
You can configure a Pod failure policy, in the .spec.podFailurePolicy
field, to meet the above use cases. This policy can handle Pod failures based on the container exit codes and the Pod conditions.
Here is a manifest for a Job that defines a podFailurePolicy
:
/controllers/job-pod-failure-policy-example.yaml
In the example above, the first rule of the Pod failure policy specifies that the Job should be marked failed if the main
container fails with the 42 exit code. The following are the rules for the main
container specifically:
an exit code of 0 means that the container succeeded
an exit code of 42 means that the entire Job failed
any other exit code represents that the container failed, and hence the entire Pod. The Pod will be re-created if the total number of restarts is below
backoffLimit
. If thebackoffLimit
is reached the entire Job failed.
Note: Because the Pod template specifies a restartPolicy: Never
, the kubelet does not restart the main
container in that particular Pod.
The second rule of the Pod failure policy, specifying the Ignore
action for failed Pods with condition DisruptionTarget
excludes Pod disruptions from being counted towards the .spec.backoffLimit
limit of retries.
Note: If the Job failed, either by the Pod failure policy or Pod backoff failure policy, and the Job is running multiple Pods, Kubernetes terminates all the Pods in that Job that are still Pending or Running.
These are some requirements and semantics of the API:
if you want to use a
.spec.podFailurePolicy
field for a Job, you must also define that Job's pod template with.spec.restartPolicy
set toNever
.the Pod failure policy rules you specify under
spec.podFailurePolicy.rules
are evaluated in order. Once a rule matches a Pod failure, the remaining rules are ignored. When no rule matches the Pod failure, the default handling applies.you may want to restrict a rule to a specific container by specifying its name in
spec.podFailurePolicy.rules[*].containerName
. When not specified the rule applies to all containers. When specified, it should match one the container orinitContainer
names in the Pod template.you may specify the action taken when a Pod failure policy is matched by
spec.podFailurePolicy.rules[*].action
. Possible values are:FailJob
: use to indicate that the Pod's job should be marked as Failed and all running Pods should be terminated.Ignore
: use to indicate that the counter towards the.spec.backoffLimit
should not be incremented and a replacement Pod should be created.Count
: use to indicate that the Pod should be handled in the default way. The counter towards the.spec.backoffLimit
should be incremented.
Note: When you use a podFailurePolicy
, the job controller only matches Pods in the Failed
phase. Pods with a deletion timestamp that are not in a terminal phase (Failed
or Succeeded
) are considered still terminating. This implies that terminating pods retain a tracking finalizer until they reach a terminal phase. Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase (see: Pod Phase). This ensures that deleted pods have their finalizers removed by the Job controller.
Job termination and cleanup
When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status. Delete the job with kubectl
(e.g. kubectl delete jobs/pi
or kubectl delete -f ./job.yaml
). When you delete the job using kubectl
, all the pods it created are deleted too.
By default, a Job will run uninterrupted unless a Pod fails (restartPolicy=Never
) or a Container exits in error (restartPolicy=OnFailure
), at which point the Job defers to the .spec.backoffLimit
described above. Once .spec.backoffLimit
has been reached the Job will be marked as failed and any running Pods will be terminated.
Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds
field of the Job to a number of seconds. The activeDeadlineSeconds
applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds
, all of its running Pods are terminated and the Job status will become type: Failed
with reason: DeadlineExceeded
.
Note that a Job's .spec.activeDeadlineSeconds
takes precedence over its .spec.backoffLimit
. Therefore, a Job that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds
, even if the backoffLimit
is not yet reached.
Example:
Note that both the Job spec and the Pod template spec within the Job have an activeDeadlineSeconds
field. Ensure that you set this field at the proper level.
Keep in mind that the restartPolicy
applies to the Pod, and not to the Job itself: there is no automatic Job restart once the Job status is type: Failed
. That is, the Job termination mechanisms activated with .spec.activeDeadlineSeconds
and .spec.backoffLimit
result in a permanent Job failure that requires manual intervention to resolve.
Clean up finished jobs automatically
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
TTL mechanism for finished Jobs
FEATURE STATE: Kubernetes v1.23 [stable]
Another way to clean up finished Jobs (either Complete
or Failed
) automatically is to use a TTL mechanism provided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished
field of the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers, will be honored.
For example:
The Job pi-with-ttl
will be eligible to be automatically deleted, 100
seconds after it finishes.
If the field is set to 0
, the Job will be eligible to be automatically deleted immediately after it finishes. If the field is unset, this Job won't be cleaned up by the TTL controller after it finishes.
Note:
It is recommended to set ttlSecondsAfterFinished
field because unmanaged jobs (Jobs that you created directly, and not indirectly through other workload APIs such as CronJob) have a default deletion policy of orphanDependents
causing Pods created by an unmanaged Job to be left around after that Job is fully deleted. Even though the control plane eventually garbage collects the Pods from a deleted Job after they either fail or complete, sometimes those lingering pods may cause cluster performance degradation or in worst case cause the cluster to go offline due to this degradation.
You can use LimitRanges and ResourceQuotas to place a cap on the amount of resources that a particular namespace can consume.
Job patterns
The Job object can be used to support reliable parallel execution of Pods. The Job object is not designed to support closely-communicating parallel processes, as commonly found in scientific computing. It does support parallel processing of a set of independent but related work items. These might be emails to be sent, frames to be rendered, files to be transcoded, ranges of keys in a NoSQL database to scan, and so on.
In a complex system, there may be multiple different sets of work items. Here we are just considering one set of work items that the user wants to manage together — a batch job.
There are several different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs are:
One Job object for each work item, vs. a single Job object for all work items. The latter is better for large numbers of work items. The former creates some overhead for the user and for the system to manage large numbers of Job objects.
Number of pods created equals number of work items, vs. each Pod can process multiple work items. The former typically requires less modification to existing code and containers. The latter is better for large numbers of work items, for similar reasons to the previous bullet.
Several approaches use a work queue. This requires running a queue service, and modifications to the existing program or container to make it use the work queue. Other approaches are easier to adapt to an existing containerised application.
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs. The pattern names are also links to examples and more detailed description.
Pattern | Single Job object | Fewer pods than work items? | Use app unmodified? |
---|---|---|---|
✓ | sometimes | ||
✓ | ✓ | ||
✓ | ✓ | ||
✓ | |||
✓ | sometimes | sometimes |
When you specify completions with .spec.completions
, each Pod created by the Job controller has an identical spec
. This means that all pods for a task will have the same command line and the same image, the same volumes, and (almost) the same environment variables. These patterns are different ways to arrange for pods to work on different things.
This table shows the required settings for .spec.parallelism
and .spec.completions
for each of the patterns. Here, W
is the number of work items.
Pattern |
|
|
W | any | |
null | any | |
W | any | |
1 | should be 1 | |
W | W |
Advanced usage
Suspending a Job
FEATURE STATE: Kubernetes v1.24 [stable]
When a Job is created, the Job controller will immediately begin creating Pods to satisfy the Job's requirements and will continue to do so until the Job is complete. However, you may want to temporarily suspend a Job's execution and resume it later, or start Jobs in suspended state and have a custom controller decide later when to start them.
To suspend a Job, you can update the .spec.suspend
field of the Job to true; later, when you want to resume it again, update it to false. Creating a Job with .spec.suspend
set to true will create it in the suspended state.
When a Job is resumed from suspension, its .status.startTime
field will be reset to the current time. This means that the .spec.activeDeadlineSeconds
timer will be stopped and reset when a Job is suspended and resumed.
When you suspend a Job, any running Pods that don't have a status of Completed
will be terminated. with a SIGTERM signal. The Pod's graceful termination period will be honored and your Pod must handle this signal in this period. This may involve saving progress for later or undoing changes. Pods terminated this way will not count towards the Job's completions
count.
An example Job definition in the suspended state can be like so:
You can also toggle Job suspension by patching the Job using the command line.
Suspend an active Job:
Resume a suspended Job:
The Job's status can be used to determine if a Job is suspended or has been suspended in the past:
The Job condition of type "Suspended" with status "True" means the Job is suspended; the lastTransitionTime
field can be used to determine how long the Job has been suspended for. If the status of that condition is "False", then the Job was previously suspended and is now running. If such a condition does not exist in the Job's status, the Job has never been stopped.
Events are also created when the Job is suspended and resumed:
The last four events, particularly the "Suspended" and "Resumed" events, are directly a result of toggling the .spec.suspend
field. In the time between these two events, we see that no Pods were created, but Pod creation restarted as soon as the Job was resumed.
Mutable Scheduling Directives
FEATURE STATE: Kubernetes v1.27 [stable]
In most cases a parallel job will want the pods to run with constraints, like all in the same zone, or all either on GPU model x or y but not a mix of both.
The suspend field is the first step towards achieving those semantics. Suspend allows a custom queue controller to decide when a job should start; However, once a job is unsuspended, a custom queue controller has no influence on where the pods of a job will actually land.
This feature allows updating a Job's scheduling directives before it starts, which gives custom queue controllers the ability to influence pod placement while at the same time offloading actual pod-to-node assignment to kube-scheduler. This is allowed only for suspended Jobs that have never been unsuspended before.
The fields in a Job's pod template that can be updated are node affinity, node selector, tolerations, labels, annotations and scheduling gates.
Specifying your own Pod selector
Normally, when you create a Job object, you do not specify .spec.selector
. The system defaulting logic adds this field when the Job is created. It picks a selector value that will not overlap with any other jobs.
However, in some cases, you might need to override this automatically set selector. To do this, you can specify the .spec.selector
of the Job.
Be very careful when doing this. If you specify a label selector which is not unique to the pods of that Job, and which matches unrelated Pods, then pods of the unrelated job may be deleted, or this Job may count other Pods as completing it, or one or both Jobs may refuse to create Pods or run to completion. If a non-unique selector is chosen, then other controllers (e.g. ReplicationController) and their Pods may behave in unpredictable ways too. Kubernetes will not stop you from making a mistake when specifying .spec.selector
.
Here is an example of a case when you might want to use this feature.
Say Job old
is already running. You want existing Pods to keep running, but you want the rest of the Pods it creates to use a different pod template and for the Job to have a new name. You cannot update the Job because these fields are not updatable. Therefore, you delete Job old
but leave its pods running, using kubectl delete jobs/old --cascade=orphan
. Before deleting it, you make a note of what selector it uses:
The output is similar to this:
Then you create a new Job with name new
and you explicitly specify the same selector. Since the existing Pods have label batch.kubernetes.io/controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002
, they are controlled by Job new
as well.
You need to specify manualSelector: true
in the new Job since you are not using the selector that the system normally generates for you automatically.
The new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002
. Setting manualSelector: true
tells the system that you know what you are doing and to allow this mismatch.
Job tracking with finalizers
FEATURE STATE: Kubernetes v1.26 [stable]
Note: The control plane doesn't track Jobs using finalizers, if the Jobs were created when the feature gate JobTrackingWithFinalizers
was disabled, even after you upgrade the control plane to 1.26.
The control plane keeps track of the Pods that belong to any Job and notices if any such Pod is removed from the API server. To do that, the Job controller creates Pods with the finalizer batch.kubernetes.io/job-tracking
. The controller removes the finalizer only after the Pod has been accounted for in the Job status, allowing the Pod to be removed by other controllers or users.
Jobs created before upgrading to Kubernetes 1.26 or before the feature gate JobTrackingWithFinalizers
is enabled are tracked without the use of Pod finalizers. The Job controller updates the status counters for succeeded
and failed
Pods based only on the Pods that exist in the cluster. The contol plane can lose track of the progress of the Job if Pods are deleted from the cluster.
You can determine if the control plane is tracking a Job using Pod finalizers by checking if the Job has the annotation batch.kubernetes.io/job-tracking
. You should not manually add or remove this annotation from Jobs. Instead, you can recreate the Jobs to ensure they are tracked using Pod finalizers.
Elastic Indexed Jobs
FEATURE STATE: Kubernetes v1.27 [beta]
You can scale Indexed Jobs up or down by mutating both .spec.parallelism
and .spec.completions
together such that .spec.parallelism == .spec.completions
. When the ElasticIndexedJob
feature gate on the API server is disabled, .spec.completions
is immutable.
Use cases for elastic Indexed Jobs include batch workloads which require scaling an indexed Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
Last updated