Kubernetes High Availability Best Practices: Mastering Pod Distribution


When running applications in Kubernetes, ensuring high availability isn’t just about having multiple replicas – it’s about intelligently distributing those replicas across your infrastructure. In this blog post, we’ll dive deep into advanced pod distribution strategies that help maintain application resilience and optimal performance. Assuming these pods will always kept separated automatically by Kubernetes is a big mistake, as the scheduling algorithm at some point can place same pod on the same nodes if the pod distribution configuration not properly configured.

Understanding Pod Distribution Challenges

Before we explore solutions, let’s understand the challenges:

Pod Anti-Affinity: Keeping Pods Apart

Pod anti-affinity is one of the most powerful tools for ensuring high availability. It allows you to define rules that prevent pods from being scheduled together based on specified criteria. Below diagram shows what soft and hard anti-affinity will do.

Here’s an example of how to implement hard pod anti-affinity:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web
            topologyKey: kubernetes.io/hostname

This configuration ensures that pods with the label app: web will never be scheduled on the same node. For more flexibility, you can use preferredDuringSchedulingIgnoredDuringExecution to implement soft anti-affinity rules.

Topology Spread Constraints: Even Distribution Across Your Cluster

While anti-affinity helps keep pods apart, topology spread constraints ensure even distribution across your infrastructure. This is particularly important in multi-zone clusters. As in the below diagram, it shows how pod topology spread can help to ensure service are resilient on node failure:

Here’s an example of implementing topology spread constraints:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 6
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web

This configuration ensures that:

Combining Strategies for Maximum Resilience

For optimal high availability, combine both approaches:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 6
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web

This configuration provides:

Best Practices and Recommendations

  1. Start with Soft Constraints
    Begin with soft anti-affinity rules and topology spread constraints during initial deployment. This provides flexibility while you evaluate the impact on your cluster.

  2. Monitor Pod Distribution
    Regular monitoring of pod distribution is crucial. Use tools like:

   kubectl get pods -o wide
   kubectl describe nodes | grep -A5 "Non-terminated Pods"
  1. Consider Resource Requirements
    When implementing distribution strategies, account for:
  1. Plan for Failure Scenarios
    Test your configuration under different failure scenarios:

Summary

It is very critical in implementing proper pod distribution strategies as it is crucial for maintaining high availability in Kubernetes clusters especially for mission critical services. By combining pod anti-affinity with topology spread constraints, you can create resilient applications that can withstand various types of infrastructure failures.

Remember that these configurations should be tailored to your specific needs, considering factors like:

Regular testing and monitoring will help ensure your chosen strategies effectively maintain the desired level of availability for your applications.

Such approach will help us to plan for a maintenance with no downtime required for example like cluster patching in this post.