First Steps in Troubleshooting

Check Status of Cluster:

  • Use kubectl get nodes to make sure that all nodes are in the ‘Ready’ state.

Review what happened:

  • kubectl get events -n namespace shows what has been going on in the cluster recently.

Access Logs:

  • Use kubectl logs <pod-name> to look at recent logs and learn about possible problems.

Troubleshooting at the node level

Node Conditions:
  • Use kubectl describe node <node-name> to see what’s going on. Check for DiskPressure, MemoryPressure, or PID Pressure.
Use of the Resource:
  • Use top or free right on the node to keep track of how resources are being used.
Kubelet Status:
  • Make sure that the node’s kubelet service is working. Use journalctl -u kubelet to look at the logs.

Troubleshooting at the pod level

  • Type kubectl describe pod <pod-name> to find out more about the state and events of a pod.
Container Inspections:
  • Use kubectl logs <pod-name> -c <container-name> to look at the logs for a particular container.
Pod Restart Problems:
  • Check the logs for crash loops and think about using kubectl describe.

Network Troubleshooting

Service Discovery:
  • Use kubectl get endpoints to make sure services are properly pointing to pods.
Pod-to-Pod Communication:
  • Use tools like ping or curl from inside a pod to test how well they work.
Network Policies:
  • Use kubectl get networkpolicy to look over the policies and make sure that the desired network traffic paths are allowed.
DNS Problems:
  • Make sure the CoreDNS or kube-dns service is running and looking up service names properly.

Storage Troubleshooting

PV & PVC Status:
  • Use kubectl get pv,pvc to check the binding status of persistent volumes and claims.
Access Modes:
  • Make sure the access mode of the pod fits the access mode of the provisioned PV.
Storage Class Problems:
  • Make sure the right storage class is given and the provisioner is working.
Mount Problems:
  • Use kubectl describe pod to see any problems with mounts that are caused by pods.

Advanced Troubleshooting in Kubernetes

Kubernetes, as the leading container orchestration platform, presents multiple intricate components that could lead to potential issues in various scenarios. Efficiently diagnosing these problems is an essential skill. Let’s dive into some common troubleshooting scenarios and their resolutions.

Preliminary Troubleshooting Steps
1. Examining Cluster Health
  • Description: In this scenario, we’ll intentionally taint a node to make it unschedulable and then inspect its state.

  • Scenario Creation:

    kubectl taint nodes test-node key=value:NoSchedule
  • Troubleshooting:

    kubectl get nodes
    kubectl describe node test-node
2. Log Analysis
  • Description: A simulated faulty application will be deployed, which will exit immediately after logging an error.

  • Scenario Creation:

    kubectl run faulty-app --image=busybox --command -- /bin/sh -c "echo 'Error: Something went wrong!' && exit 1"
  • Troubleshooting:

    kubectl logs faulty-app
Node-level Troubleshooting
1. Node Resource Exhaustion
  • Description: A pod demanding a high amount of memory will be scheduled, potentially leading to resource exhaustion on the node.

  • Scenario Creation:

    kubectl run resource-hog --image=busybox --requests='memory=800Mi' -- /bin/sh -c "while true; do sleep 1; done"
  • Troubleshooting:

    kubectl describe node test-node
Pod-level Troubleshooting
1. Crashing Pods
  • Description: Investigate the reasons behind a crashing pod (this scenario has been previously set up in the log analysis example).

  • Troubleshooting:

    kubectl describe pod faulty-app
2. Pod Access
  • Description: Deploy a simple pod and access its shell, ensuring that there are no access-related issues.

  • Scenario Creation:

    kubectl run simple-pod --image=busybox --command -- /bin/sh -c "sleep 3600"
  • Troubleshooting:

    kubectl exec -it simple-pod -- /bin/sh
Network Troubleshooting
1. Networking Issues
  • Description: Create two pods in different namespaces and attempt to communicate between them.

  • Scenario Creation:

    kubectl create namespace ns1
    kubectl create namespace ns2
    kubectl run nginx1 --image=nginx --namespace=ns1
    kubectl run nginx2 --image=nginx --namespace=ns2
  • Troubleshooting:

    kubectl exec -it -n ns1 nginx1 -- curl nginx2.ns2.svc.cluster.local
2. Network Policies
  • Description: Establish a network policy that blocks incoming traffic and diagnose its impact.

  • Scenario Creation:

    kubectl apply -f- <<EOF
    kind: NetworkPolicy
      name: block-all
      podSelector: {}
      - Ingress
  • Troubleshooting:

    kubectl get networkpolicies
    kubectl describe networkpolicy block-all
Storage Troubleshooting
1. PV and PVC Binding
  • Description: Simulate a mismatch between the configurations of a PersistentVolume and a PersistentVolumeClaim.

  • Scenario Creation:

    1. Generate PersistentVolume YAML:
    kubectl create pv example-pv --storage-class=manual --capacity=storage=1Gi --access-mode=ReadWriteOnce --host-path=path="/tmp" --dry-run=client -o yaml > example-pv.yaml

    Edit example-pv.yaml to ensure the hostPath section is:

      path: "/tmp"

    Apply the configuration:

    kubectl apply -f example-pv.yaml
    1. Generate PersistentVolumeClaim YAML:
    kubectl create pvc example-pvc --storage-class=manual --access-mode=ReadWriteMany --resources=requests=storage=1Gi --dry-run=client -o yaml > example-pvc.yaml

    Apply the configuration:

    kubectl apply -f example-pvc.yaml
  • Troubleshooting:

    kubectl describe pvc example-pvc