Kubernetes Session 2

What is a Kubernetes Namespace?

A Kubernetes namespace helps separate a cluster into logical units. It helps granularly organize, allocate, manage, and secure cluster resources. Here are two notable use cases for Kubernetes namespaces:

  • Apply policies to cluster segments—Kubernetes namespaces let you apply policies to different parts of a cluster. For example, you can define resource policies to limit resource consumption. You can also use container network interfaces (CNIs) to apply network policies that define how communication is achieved between pods in each namespace. Learn more about Kubernetes networking.

  • Apply access controls—namespaces let you define role-based access control (RBAC). You can define a role object type and assign it using role binding. The role you define is applied to a namespace, and RoleBinding is applied to specific objects within this namespace. Using this technique can help you improve the security of your cluster.

In a new cluster, Kubernetes automatically creates the following namespaces: default (for user workloads) and three namespaces for the Kubernetes control plane: kube-node-lease, kube-public, and kube-system. Kubernetes also allows admins to manually create custom namespaces.

Related content: Read our guide to Kubernetes architecture

In this article:

Kubernetes Namespaces Concepts

There are two types of Kubernetes namespaces: Kubernetes system namespaces and custom namespaces.

Default Kubernetes namespaces

Here are the four default namespaces Kubernetes creates automatically:

  • default—a default space for objects that do not have a specified namespace.

  • kube-system—a default space for Kubernetes system objects, such as kube-dns and kube-proxy, and add-ons providing cluster-level features, such as web UI dashboards, ingresses, and cluster-level logging.

  • kube-public—a default space for resources available to all users without authentication.

  • kube-node-lease—a default space for objects related to cluster scaling.

Custom Kubernetes namespaces

Admins can create as many Kubernetes namespaces as necessary to isolate workloads or resources and limit access to specific users. Here is how to create a namespace using kubectl:

kubectl create ns mynamespace

Related content: Read our guide to Kubernetes clusters

The Hierarchical Namespace Controller (HNC)

Hierarchical namespaces are an extension to the Kubernetes namespaces mechanism, which allows you to organize groups of namespaces that have a common owner. For example, when a cluster is shared by multiple teams, each team can have a group of namespaces that belong to them.

With hierarchical namespaces, you can create a team namespace, and under it namespaces for specific workloads. You don’t need cluster-level permission to create a namespace within your team namespace, and you also have the flexibility to apply different RBAC rules and network security groups at each level of the namespace hierarchy.

When Should You Use Multiple Kubernetes Namespaces?

In small projects or teams, where there is no need to isolate workloads or users from each other, it can be reasonable to use the default Kubernetes namespace. Consider using multiple namespaces for the following reasons:

  • Isolation—if you have a large team, or several teams working on the same cluster, you can use namespaces to create separation between projects and microservices. Activity in one namespace never affects the other namespaces.

  • Development stages—if you use the same Kubernetes cluster for multiple stages of the development lifecycle, it is a good idea to separate development, testing, and production environments. You do not want errors or instability in testing environments to affect production users. Ideally, you should use a separate cluster for each environment, but if this is not possible, namespaces can create this separation.

  • Permissions—it might be necessary to define separate permissions for different resources in your cluster. You can define separate RBAC rules for each namespace, ensuring that only authorized roles can access the resources in the namespace. This is especially important for mission critical applications, and to protect sensitive data in production deployments.

  • Resource control—you can define resource limits at the namespace level, ensuring each namespace has access to a certain amount of CPU and memory resources. This enables separating cluster resources among several projects and ensuring each project has the resources it needs, leaving sufficient resources for other projects.

  • Performance—Kubernetes API provides better performance when you define namespaces in the cluster. This is because the API has less items to search through when you perform specific operations.

Quick Tutorial: Working with Kubernetes Namespaces

Let’s see how to perform basic namespace operations—creating a namespace, viewing existing namespaces, and creating a pod in a namespace.

Creating Namespaces

You can create a namespace with a simple kubectl command, like this:

kubectl create namespace mynamespace

This will create a namespace called mynamespace, with default configuration. If you want more control over the namespace, you can create a YAML file and apply it. Here is an example of a namespace YAML file:

kind: Namespace
apiVersion: v1
metadata:
  name: mynamespace
  labels:
    name: mynamespace
kubectl apply -f test.yaml

Viewing Namespaces

To list all the namespaces currently active in your cluster, run this command:

kubectl get namespace

The output will look something like this:

NAME             STATUS   AGE
default          Active   3d
kube-node-lease  Active   3d
kube-public      Active   3d
kube-system      Active   3d
mynamespace      Active   1d

Creating Resources in the Namespace

When you create a resource in Kubernetes without specifying a namespace, it is automatically created in the current namespace.

For example, the following pod specification does not specify a namespace:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
  labels:
    name: mypod
spec:
  containers:
 —name: mypod
    image: nginx

When you apply this pod specification, the following will happen:

  • If you did not create any namespaces in your cluster, the pod will be created in the default namespace

  • If you created a namespace and are currently running in it, the pod will be created in that namespace.

How can you explicitly create a resource in a specific namespace?

There are two ways to do this:

  1. Use the –namespace flag when creating the resource, like this:

kubectl apply -f pod.yaml --namespace=mynamespace
  1. Specify a namespace in the YAML specification of the resource. Here is what it looks like in a pod specification:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
  namespace: mynamespace
  labels:
    name: mypod
...

Important notes:

  • Note that if your YAML specification specifies one namespace, but the apply command specifies another namespace, the command will fail.

  • If you try to work with a Kubernetes resource in a different namespace, kubectl will not find the resource. Us the –namespace flag to work with resources in other namespaces, like this:

kubectl get pods --namespace=mynamespace

What Are Kubernetes Nodes?

A Kubernetes node is a worker machine that runs Kubernetes workloads. It can be a physical (bare metal) machine or a virtual machine (VM). Each node can host one or more pods. Kubernetes nodes are managed by a control plane, which automatically handles the deployment and scheduling of pods across nodes in a Kubernetes cluster. When scheduling pods, the control plane assesses the resources available on each node.

Each node runs two main components—a kubelet and a container runtime. The kubelet is in charge of facilitating communication between the control plane and the node. The container runtime is in charge of pulling the relevant container image from a registry, unpacking containers, running them on the node, and communicating with the operating system kernel.

In this article:

Kubernetes Node Components

Here are three main Kubernetes node components:

kubelet

The Kubelet is responsible for managing the deployment of pods to Kubernetes nodes. It receives commands from the API server and instructs the container runtime to start or stop containers as needed.

kube-proxy

A network proxy running on each Kubernetes node. It is responsible for maintaining network rules on each node. Network rules enable network communication between nodes and pods. Kube-proxy can directly forward traffic or use the operating system packet filter layer.

Container runtime

The software layer responsible for running containers. There are several container runtimes supported by Kubernetes, including Containerd, CRI-O, Docker, and other Kubernetes Container Runtime Interface (CRI) implementations.

Working with Kubernetes Nodes: 4 Basic Operations

Here is how to perform common operations on a Kubernetes node.

1. Adding Node to a Cluster

You can manually add nodes to a Kubernetes cluster, or let the kubelet on that node self-register to the control plane. Once a node object is created manually or by the kubelet, the control plane validates the new node object.

Adding nodes automatically

The example below is a JSON manifest that creates a node object. Kubernetes then checks that the kubelet registered to the API server matches the node’s metadata.name field. Only healthy nodes running all necessary services are eligible to run the pod. If the check fails, the node is ignored until it becomes healthy.

{
  "kind": "Node",
  "apiVersion": "v1",
  "metadata": {
    "name": "<node-ip-address>",
    "labels": {
      "name": "<node-logical-name>"
    }
  }
}

Defining node capacity

Nodes that self-register with the API Server report their CPU and memory volume capacity after the node object is created. However, When manually creating the node, administrators need to set up capacity demands. Once this information is defined, the Kubernetes scheduler assigns resources to all pods running on a node. The scheduler is responsible for ensuring that requests do not exceed node capacity.

2. Modifying Node Objects

You can use kubectl to manually create or modify node objects, overriding the settings defined in --register-node. You can, for example:

  • Use labels on nodes and node selectors to control scheduling. You can, for example, limit a pod to be eligible only for running on a subset of available nodes.

  • Mark a node as unschedulable to prevent the scheduler from adding new pods to the node. This action does not affect pods running on the node. You can use this option in preparation for maintenance tasks like node reboot. To mark the node as unschedulable, you can run: kubectl cordon $NODENAME.

3. Checking Node Status

There are three primary commands you can use to determine the status of a node and the resources running on it.

kubectl describe nodes

Run the command kubectl describe nodes my-node to get node information including:

  • HostName—reported by the node operating system kernel. You can report a different value for HostName using the kubelet flag --hostname-override.

  • InternalIP—enables traffic to be routed to the node within the cluster.

  • ExternalID—an IP that can be used to access the node from outside the cluster.

  • Conditions—system resource issues including CPU and memory utilization. This section shows error conditions like OutOfDisk, MemoryPressure, and DiskPressure.

  • Events—this section shows issues occurring in the environment, such as eviction of pods.

kubectl describe pods

You can use this command to get information about pods running on a node:

  • Pod information—labels, resource requirements, and containers running in the pod

  • Pod ready state—if a pod appears as READY, it means it passed the last readiness check.

  • Container state—can be Waiting, Running, or Terminated.

  • Restart count—how often a container has been restarted.

  • Log events—showing activity on the pod, indicating which component logged the event, for which object, a Reason and a Message explaining what happened.

4. Understanding the Node Controller

The node controller is the control plane component responsible for managing several aspects of the node’s lifecycle. Here are the three main roles of the node controller:

Assigning CIDR addresses

When the node is registered, the node controller assigns a Cross Inter-Domain Routing (CIDR) block (if CIDR assignment is enabled).

Updating internal node lists

The node controller maintains an internal list of nodes. It needs to be updated constantly with the list of machines available by the cloud provider. This list enables the node controller to ensure capacity is met.

When a node is unhealthy, the node controller checks if the host machine for that node is available. If the VM is not available, the node controller deletes the node from the internal list. If Kubernetes is running on a public or private cloud, the node controller can send a request to create a new node, to maintain cluster capacity.

Monitoring the health of nodes

Here are several tasks the node controller is responsible for:

  • Checking the state of all nodes periodically, with the period determined by the --node-monitor-period flag.

  • Updating the NodeReady condition to ConditionUnknown if the node becomes unreachable and the node controller no longer receives heartbeats.

  • Evicting all pods from the node. If the node remains unreachable, the node controller uses graceful termination to evict the pods. Timeouts are set by default to 40 seconds, before reporting ConditionUnknown. Five minutes later, the node controller starts evicting pods.

Kubernetes Node Security

kubelet

This component is the main node agent for managing individual containers that run in a pod. Vulnerabilities associated with the kubelet are constantly discovered, meaning that you need to regularly upgrade the kubelet versions and apply the latest patches. Access to the kubelet is not authenticated by default, so you should implement strong authentication measures to restrict access.

kube-proxy

This component handles request forwarding according to network rules. It is a network proxy that supports various protocols (i.e. TCP, UDP) and allows Kubernetes services to be exposed. There are two ways to secure kube-proxy:

  • If proxy configuration is maintained via the kubeconfig file, restrict file permissions to ensure unauthorized parties cannot tamper with proxy settings.

  • Ensure that communication with the API server is only done over a secured port, and always require authentication and authorization.

Hardened Node Security

You can harden your noder security by following these steps:

  • Ensure the host is properly configured and secure—check your configuration to ensure it meets the CIS Benchmarks standards.

  • Control access to sensitive ports—ensure the network blocks access to ports that kubelet uses. Limit Kubernetes API server access to trusted networks.

  • Limit administrative access to nodes—ensure your Kubernetes nodes have restricted access. You can handle tasks like debugging and without having direct access to a node.

Isolation of Sensitive Workloads

You should run any sensitive workload on dedicated machines to minimize the impact of a breach. Isolating workloads prevents an attacker from accessing sensitive applications through lower-priority applications on the same host or with the same container runtime. Attackers can only exploit the kubelet credentials of compromised nodes to access secrets that are mounted on those nodes. You can use controls such as node pools, namespaces, tolerations and taints to isolate your workloads.

Taints and Tolerations

Node affinity is a property of Pods that attracts them to a set of nodes (either as a preference or a hard requirement). Taints are the opposite -- they allow a node to repel a set of pods.

Tolerations are applied to pods. Tolerations allow the scheduler to schedule pods with matching taints. Tolerations allow scheduling but don't guarantee scheduling: the scheduler also evaluates other parameters as part of its function.

Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.

Concepts

You add a taint to a node using kubectl taint. For example,

kubectl taint nodes node1 key1=value1:NoSchedule

places a taint on node node1. The taint has key key1, value value1, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.

To remove the taint added by the command above, you can run:

kubectl taint nodes node1 key1=value1:NoSchedule-

You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match" the taint created by the kubectl taint line above, and thus a pod with either toleration would be able to schedule onto node1:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
tolerations:
- key: "key1"
  operator: "Exists"
  effect: "NoSchedule"

Here's an example of a pod that uses tolerations:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  tolerations:
  - key: "example-key"
    operator: "Exists"
    effect: "NoSchedule"

The default value for operator is Equal.

A toleration "matches" a taint if the keys are the same and the effects are the same, and:

  • the operator is Exists (in which case no value should be specified), or

  • the operator is Equal and the values are equal.

Note:

There are two special cases:

An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything.

An empty effect matches all effects with key key1.

The above example used effect of NoSchedule. Alternatively, you can use effect of PreferNoSchedule. This is a "preference" or "soft" version of NoSchedule -- the system will try to avoid placing a pod that does not tolerate the taint on the node, but it is not required. The third kind of effect is NoExecute, described later.

You can put multiple taints on the same node and multiple tolerations on the same pod. The way Kubernetes processes multiple taints and tolerations is like a filter: start with all of a node's taints, then ignore the ones for which the pod has a matching toleration; the remaining un-ignored taints have the indicated effects on the pod. In particular,

  • if there is at least one un-ignored taint with effect NoSchedule then Kubernetes will not schedule the pod onto that node

  • if there is no un-ignored taint with effect NoSchedule but there is at least one un-ignored taint with effect PreferNoSchedule then Kubernetes will try to not schedule the pod onto the node

  • if there is at least one un-ignored taint with effect NoExecute then the pod will be evicted from the node (if it is already running on the node), and will not be scheduled onto the node (if it is not yet running on the node).

For example, imagine you taint a node like this

kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key2=value2:NoSchedule

And a pod has two tolerations:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"

In this case, the pod will not be able to schedule onto the node, because there is no toleration matching the third taint. But it will be able to continue running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

Normally, if a taint with effect NoExecute is added to a node, then any pods that do not tolerate the taint will be evicted immediately, and pods that do tolerate the taint will never be evicted. However, a toleration with NoExecute effect can specify an optional tolerationSeconds field that dictates how long the pod will stay bound to the node after the taint is added. For example,

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

means that if this pod is running and a matching taint is added to the node, then the pod will stay bound to the node for 3600 seconds, and then be evicted. If the taint is removed before that time, the pod will not be evicted.

Example Use Cases

Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn't be running. A few of the use cases are

  • Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a particular set of users, you can add a taint to those nodes (say, kubectl taint nodes nodename dedicated=groupName:NoSchedule) and then add a corresponding toleration to their pods (this would be done most easily by writing a custom admission controller). The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as well as any other nodes in the cluster. If you want to dedicate the nodes to them and ensure they only use the dedicated nodes, then you should additionally add a label similar to the taint to the same set of nodes (e.g. dedicated=groupName), and the admission controller should additionally add a node affinity to require that the pods can only schedule onto nodes labeled with dedicated=groupName.

  • Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don't need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule or kubectl taint nodes nodename special=true:PreferNoSchedule) and adding a corresponding toleration to pods that use the special hardware. As in the dedicated nodes use case, it is probably easiest to apply the tolerations using a custom admission controller. For example, it is recommended to use Extended Resources to represent the special hardware, taint your special hardware nodes with the extended resource name and run the ExtendedResourceToleration admission controller. Now, because the nodes are tainted, no pods without the toleration will schedule on them. But when you submit a pod that requests the extended resource, the ExtendedResourceToleration admission controller will automatically add the correct toleration to the pod and that pod will schedule on the special hardware nodes. This will make sure that these special hardware nodes are dedicated for pods requesting such hardware and you don't have to manually add tolerations to your pods.

  • Taint based Evictions: A per-pod-configurable eviction behavior when there are node problems, which is described in the next section.

Taint based Evictions

FEATURE STATE: Kubernetes v1.18 [stable]

The NoExecute taint effect, mentioned above, affects pods that are already running on the node as follows

  • pods that do not tolerate the taint are evicted immediately

  • pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever

  • pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time

The node controller automatically taints a Node when certain conditions are true. The following taints are built in:

  • node.kubernetes.io/not-ready: Node is not ready. This corresponds to the NodeCondition Ready being "False".

  • node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being "Unknown".

  • node.kubernetes.io/memory-pressure: Node has memory pressure.

  • node.kubernetes.io/disk-pressure: Node has disk pressure.

  • node.kubernetes.io/pid-pressure: Node has PID pressure.

  • node.kubernetes.io/network-unavailable: Node's network is unavailable.

  • node.kubernetes.io/unschedulable: Node is unschedulable.

  • node.cloudprovider.kubernetes.io/uninitialized: When the kubelet is started with "external" cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.

In case a node is to be evicted, the node controller or the kubelet adds relevant taints with NoExecute effect. If the fault condition returns to normal the kubelet or node controller can remove the relevant taint(s).

In some cases when the node is unreachable, the API server is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the API server is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.

Note: The control plane limits the rate of adding node new taints to nodes. This rate limiting manages the number of evictions that are triggered when many nodes become unreachable at once (for example: if there is a network disruption).

You can specify tolerationSeconds for a Pod to define how long that Pod stays bound to a failing or unresponsive Node.

For example, you might want to keep an application with a lot of local state bound to node for a long time in the event of network partition, hoping that the partition will recover and thus the pod eviction can be avoided. The toleration you set for that Pod might look like:

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 6000

Note:

Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless you, or a controller, set those tolerations explicitly.

These automatically-added tolerations mean that Pods remain bound to Nodes for 5 minutes after one of these problems is detected.

DaemonSet pods are created with NoExecute tolerations for the following taints with no tolerationSeconds:

  • node.kubernetes.io/unreachable

  • node.kubernetes.io/not-ready

This ensures that DaemonSet pods are never evicted due to these problems.

Taint Nodes by Condition

The control plane, using the node controller, automatically creates taints with a NoSchedule effect for node conditions.

The scheduler checks taints, not node conditions, when it makes scheduling decisions. This ensures that node conditions don't directly affect scheduling. For example, if the DiskPressure node condition is active, the control plane adds the node.kubernetes.io/disk-pressure taint and does not schedule new pods onto the affected node. If the MemoryPressure node condition is active, the control plane adds the node.kubernetes.io/memory-pressure taint.

You can ignore node conditions for newly created pods by adding the corresponding Pod tolerations. The control plane also adds the node.kubernetes.io/memory-pressure toleration on pods that have a QoS class other than BestEffort. This is because Kubernetes treats pods in the Guaranteed or Burstable QoS classes (even pods with no memory request set) as if they are able to cope with memory pressure, while new BestEffort pods are not scheduled onto the affected node.

The DaemonSet controller automatically adds the following NoSchedule tolerations to all daemons, to prevent DaemonSets from breaking.

  • node.kubernetes.io/memory-pressure

  • node.kubernetes.io/disk-pressure

  • node.kubernetes.io/pid-pressure (1.14 or later)

  • node.kubernetes.io/unschedulable (1.10 or later)

  • node.kubernetes.io/network-unavailable (host network only)

Adding these tolerations ensures backward compatibility. You can also add arbitrary tolerations to DaemonSets.

Kubernetes Nodes need occasional maintenance. You could be updating the Node’s kernel, resizing its compute resource in your cloud account, or replacing physical hardware components in a self-hosted installation.

Kubernetes cordons and drains are two mechanisms you can use to safely prepare for Node downtime. They allow workloads running on a target Node to be rescheduled onto other ones. You can then shutdown the Node or remove it from your cluster without impacting service availability.

Applying a Node Cordon

Cordoning a Node marks it as unavailable to the Kubernetes scheduler. The Node will be ineligible to host any new Pods subsequently added to your cluster.

Use the kubectl cordon command to place a cordon around a named Node:

$ kubectl cordon node-1
node/node-1 cordoned

Existing Pods already running on the Node won’t be affected by the cordon. They’ll remain accessible and will still be hosted by the cordoned Node.

You can check which of your Nodes are currently cordoned with the get nodes command:

$ kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
node-1     Ready,SchedulingDisabled   control-plane,master   26m   v1.23.3

Cordoned nodes appear with the SchedulingDisabled status.

Draining a Node

The next step is to drain remaining Pods out of the Node. This procedure will evict the Pods so they’re rescheduled onto other Nodes in your cluster. Pods are allowed to gracefully terminate before they’re forcefully removed from the target Node.

Run kubectl drain to initiate a drain procedure. Specify the name of the Node you’re taking out for maintenance:

$ kubectl drain node-1
node/node-1 already cordoned
evicting pod kube-system/storage-provisioner
evicting pod default/nginx-7c658794b9-zszdd
evicting pod kube-system/coredns-64897985d-dp6lx
pod/storage-provisioner evicted
pod/nginx-7c658794b9-zszdd evicted
pod/coredns-64897985d-dp6lx evicted
node/node-1 evicted

The drain procedure first cordons the Node if you’ve not already placed one manually. It will then evict running Kubernetes workloads by safely rescheduling them to other Nodes in your cluster.

You can shutdown or destroy the Node once the drain’s completed. You’ve freed the Node from its responsibilities to your cluster. The cordon provides an assurance that no new workloads have been scheduled since the drain completed.

Ignoring Pod Grace Periods

Drains can sometimes take a while to complete if your Pods have long grace periods. This might not be ideal when you need to urgently take a Node offline. Use the --grace-period flag to override Pod termination grace periods and force an immediate eviction:

$ kubectl drain node-1 --grace-period 0

This should be used with care – some workloads might not respond well if they’re stopped without being offered a chance to clean up.

Understanding the pod phase

pod’s conditions during its lifecycle

Understanding the container state

Configuring the pod’s restart policy

What are the Three Types of Kubernetes Probes?

Kubernetes provides the following types of probes. For all these types, if the container does not implement the probe handler, their result is always Success.

  • Liveness Probe—indicates if the container is operating. If so, no action is taken. If not, the kubelet kills and restarts the container. Learn more in our guide to Kubernetes liveness probes.

  • Readiness Probe—indicates whether the application running in the container is ready to accept requests. If so, Services matching the pod are allowed to send traffic to it. If not, the endpoints controller removes the pod from all matching Kubernetes Services.

  • Startup Probe—indicates whether the application running in the container has started. If so, other probes start functioning. If not, the kubelet kills and restarts the container.

When to Use Readiness Probes

Readiness probes are most useful when an application is temporarily malfunctioning and unable to serve traffic. If the application is running but not fully available, Kubernetes may not be able to scale it up and new deployments could fail. A readiness probe allows Kubernetes to wait until the service is active before sending it traffic.

When you use a readiness probe, keep in mind that Kubernetes will only send traffic to the pod if the probe succeeds.

There is no need to use a readiness probe on deletion of a pod. When a pod is deleted, it automatically puts itself into an unready state, regardless of whether readiness probes are used. It remains in this status until all containers in the pod have stopped.

How Readiness Probes Work in Kubernetes

A readiness probe can be deployed as part of several Kubernetes objects. For example, here is how to define a readiness probe in a Deployment:

Once the above Deployment object is applied to the cluster, the readiness probe runs continuously throughout the lifecycle of the application.

A readiness probe has the following configuration options:

Why Do Readiness Probes Fail? Common Error Scenarios

Readiness probes are used to verify tasks during a container lifecycle. This means that if the probe’s response is interrupted or delayed, service may be interrupted. Keep in mind that if a readiness probe returns Failure status, Kubernetes will remove the pod from all matching service endpoints. Here are two examples of conditions that can cause an application to incorrectly fail the readiness probe.

Delayed Response

In some circumstances, readiness probes may be late to respond—for example, if the application needs to read large amounts of data with low latency or perform heavy computations. Consider this behavior when configuring readiness probes, and always test your application thoroughly before running it in production with a readiness probe.

Cascading Failures

A readiness probe response can be conditional on components that are outside the direct control of the application. For example, you could configure a readiness probe using HTTPGet, in such a way that the application first checks the availability of a cache service or database before responding to the probe. This means that if the database is down or late to respond, the entire application will become unavailable.

This may or may not make sense, depending on your application setup. If the application cannot function at all without the third-party component, maybe this behavior is warranted. If it can continue functioning, for example, by falling back to a local cache, the database or external cache should not be connected to probe responses.

In general, if the pod is technically ready, even if it cannot function perfectly, it should not fail the readiness probe. A good compromise is to implement a “degraded mode,” for example, if there is no access to the database, answer read requests that can be addressed by local cache and return 503 (service unavailable) on write requests. Ensure that downstream services are resilient to a failure in the upstream service.

Methods of Checking Application Health

Startup, readiness, and liveness probes can check the health of applications in three ways: HTTP checks, container execution checks, and TCP socket checks.

HTTP Checks

An HTTP check is ideal for applications that return HTTP status codes, such as REST APIs.

HTTP probe uses GET requests to check the health of an application. The check is successful if the HTTP response code is in the range 200-399.

The following example demonstrates how to implement a readiness probe with the HTTP check method:

...contents omitted...
readinessProbe:
  httpGet:
    path: /health (1)
    port: 8080
  initialDelaySeconds: 15 (2)
  timeoutSeconds: 1 (3)
...contents omitted...
  1. The readiness probe endpoint.

  2. How long to wait after the container starts before checking its health.

  3. How long to wait for the probe to finish.

Container Execution Checks

Container execution checks are ideal in scenarios where you must determine the status of the container based on the exit code of a process or shell script running in the container.

When using container execution checks Kubernetes executes a command inside the container. Exiting the check with a status of 0 is considered a success. All other status codes are considered a failure.

The following example demonstrates how to implement a container execution check:

...contents omitted...
livenessProbe:
  exec:
    command:(1)
    - cat
    - /tmp/health
  initialDelaySeconds: 15
  timeoutSeconds: 1
...contents omitted...
  1. The command to run and its arguments, as a YAML array.

TCP Socket Checks

A TCP socket check is ideal for applications that run as daemons, and open TCP ports, such as database servers, file servers, web servers, and application servers.

When using TCP socket checks Kubernetes attempts to open a socket to the container. The container is considered healthy if the check can establish a successful connection.

The following example demonstrates how to implement a liveness probe by using the TCP socket check method:

...contents omitted...
livenessProbe:
  tcpSocket:
    port: 8080 (1)
  initialDelaySeconds: 15
  timeoutSeconds: 1
...contents omitted...
  1. The TCP port to check.

Creating Probes

To configure probes on a deployment, edit the deployment’s resource definition. To do this, you can use the kubectl edit or kubectl patch commands. Alternatively, if you already have a deployment YAML definition, you can modify it to include the probes and then apply it with kubectl apply.

The following example demonstrates using the kubectl edit command to add a readiness probe to a deployment:

[user@host ~]$ kubectl set probe deployment myapp --readiness \
--get-url=http://:8080/healthz --period=20
[user@host ~]$ kubectl patch deployment myapp \
-p '{"spec":{"containers"[0]: {"readinessProbe": {}}}}'

The following examples demonstrate using the kubectl set probe command with a variety of options:

[user@host ~]$ kubectl set probe deployment myapp --readiness \
--get-url=http://:8080/healthz --period=20
[user@host ~]$ kubectl set probe deployment myapp --liveness \
--open-tcp=3306 --period=20 \
--timeout-seconds=1
[user@host ~]$ kubectl set probe deployment myapp --liveness \
--get-url=http://:8080/healthz --initial-delay-seconds=30 \
--success-threshold=1 --failure-threshold=3

Assign Pods to Nodes using Node Affinity

This page shows how to assign a Kubernetes Pod to a particular node using Node Affinity in a Kubernetes cluster.

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:

Your Kubernetes server must be at or later than version v1.10. To check the version, enter kubectl version.

Add a label to a node

  1. List the nodes in your cluster, along with their labels:

    kubectl get nodes --show-labels

    The output is similar to this:

    NAME      STATUS    ROLES    AGE     VERSION        LABELS
    worker0   Ready     <none>   1d      v1.13.0        ...,kubernetes.io/hostname=worker0
    worker1   Ready     <none>   1d      v1.13.0        ...,kubernetes.io/hostname=worker1
    worker2   Ready     <none>   1d      v1.13.0        ...,kubernetes.io/hostname=worker2
  2. Choose one of your nodes, and add a label to it:

    kubectl label nodes <your-node-name> disktype=ssd

    where <your-node-name> is the name of your chosen node.

  3. Verify that your chosen node has a disktype=ssd label:

    kubectl get nodes --show-labels

    The output is similar to this:

    NAME      STATUS    ROLES    AGE     VERSION        LABELS
    worker0   Ready     <none>   1d      v1.13.0        ...,disktype=ssd,kubernetes.io/hostname=worker0
    worker1   Ready     <none>   1d      v1.13.0        ...,kubernetes.io/hostname=worker1
    worker2   Ready     <none>   1d      v1.13.0        ...,kubernetes.io/hostname=worker2

    In the preceding output, you can see that the worker0 node has a disktype=ssd label.

Schedule a Pod using required node affinity

This manifest describes a Pod that has a requiredDuringSchedulingIgnoredDuringExecution node affinity,disktype: ssd. This means that the pod will get scheduled only on a node that has a disktype=ssd label.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd            
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  1. Apply the manifest to create a Pod that is scheduled onto your chosen node:

    kubectl apply -f https://k8s.io/examples/pods/pod-nginx-required-affinity.yaml
  2. Verify that the pod is running on your chosen node:

    kubectl get pods --output=wide

    The output is similar to this:

    NAME     READY     STATUS    RESTARTS   AGE    IP           NODE
    nginx    1/1       Running   0          13s    10.200.0.4   worker0

Schedule a Pod using preferred node affinity

This manifest describes a Pod that has a preferredDuringSchedulingIgnoredDuringExecution node affinity,disktype: ssd. This means that the pod will prefer a node that has a disktype=ssd label.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd          
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  1. Apply the manifest to create a Pod that is scheduled onto your chosen node:

    kubectl apply -f https://k8s.io/examples/pods/pod-nginx-preferred-affinity.yaml
  2. Verify that the pod is running on your chosen node:

    kubectl get pods --output=wide

    The output is similar to this:

    NAME     READY     STATUS    RESTARTS   AGE    IP           NODE
    nginx    1/1       Running   0          13s    10.200.0.4   worker0

Last updated