Kubernetes API server unexpectedly stops responding

I have a managed Kubernetes cluster in Azure (AKS). There is one pod running a simple web service that responds to REST API calls from outside and calls the Kubernetes API server. These calls list and create some jobs.

For AKS, I have the ‘advanced’ (Azure CNI) networking with a custom routing table that redirects traffic to a virtual appliance – this is my company’s setup.

I’m using the official Python client for Kubernetes. The calls look like:

k8s_batch_api_client = client.BatchV1Api()
jobs = k8s_batch_api_client.list_namespaced_job(namespace = 'default')

So nothing special.

Most of the time, everything is working fine. However, from time to time, the Kubernetes API server just doesn’t respond to the requests, so my pod’s web service gets restarted after a timeout (it runs a gunicorn-based web server).

I installed tcpdump on my pod and sniffed the TCP traffic. I’m not a networking nerd, so bear with me.

The Python client keeps a TCP connection pool (using the urllib3 library). And it seems that the Kubernetes API server just silently ‘loses’ a TCP connection, just doesn’t react anymore without closing the connection.

In Wireshark, I see this for a working request-response:

2438   09:41:50,796695     TLSv1.3   1614   Application Data
2439   09:41:50,798552   TCP       66     443 → 56480 [ACK]
2440   09:41:50,804064   TLSv1.3   2196   Application Data is my pod, is the Kubernetes API server. We see a request and a response here.

But then:

2469   09:48:48,853533      TLSv1.3   1580   Application Data
2470   09:48:48,853604      TLSv1.3   1279   Application Data
2471   09:48:48,868222      TCP       1279   [TCP Retransmission] 56480 → 443 [PSH, ACK]
2472   09:48:49,076276      TCP       1452   [TCP Retransmission] 56480 → 443 [ACK]
... lots of retransmissions...

I see no FIN TCP packet from the Kubernetes API server (which would mean, the server wants to close the connection).

After restarting (2 minutes of retransmissions -> reboot), my pod can establish a connection to the API server right away – so the API server itself isn’t overloaded.

The same app runs without any issues on my local Minikube cluster (but there’s of course only one node, so not really representative).

How can I investigate the issue further? Can it be caused by the client side (by my pod or by the Python client)? Is there any special setting I must change on AKS or on my client side to avoid this? Does it look like a ‘server bug’ or a ‘network issue’?

Go to Source
Author: dymanoid