I have a managed Kubernetes cluster in Azure (AKS). There is one pod running a simple web service that responds to REST API calls from outside and calls the Kubernetes API server. These calls list and create some jobs.
For AKS, I have the ‘advanced’ (Azure CNI) networking with a custom routing table that redirects traffic to a virtual appliance – this is my company’s setup.
I’m using the official Python client for Kubernetes. The calls look like:
config.load_incluster_config()
k8s_batch_api_client = client.BatchV1Api()
jobs = k8s_batch_api_client.list_namespaced_job(namespace = 'default')
So nothing special.
Most of the time, everything is working fine. However, from time to time, the Kubernetes API server just doesn’t respond to the requests, so my pod’s web service gets restarted after a timeout (it runs a gunicorn
-based web server).
I installed tcpdump
on my pod and sniffed the TCP traffic. I’m not a networking nerd, so bear with me.
The Python client keeps a TCP connection pool (using the urllib3
library). And it seems that the Kubernetes API server just silently ‘loses’ a TCP connection, just doesn’t react anymore without closing the connection.
In Wireshark, I see this for a working request-response:
2438 09:41:50,796695 10.214.140.39 192.168.0.1 TLSv1.3 1614 Application Data
2439 09:41:50,798552 192.168.0.1 10.214.140.39 TCP 66 443 → 56480 [ACK]
2440 09:41:50,804064 192.168.0.1 10.214.140.39 TLSv1.3 2196 Application Data
10.214.140.39
is my pod, 192.168.0.1
is the Kubernetes API server. We see a request and a response here.
But then:
2469 09:48:48,853533 10.214.140.39 192.168.0.1 TLSv1.3 1580 Application Data
2470 09:48:48,853604 10.214.140.39 192.168.0.1 TLSv1.3 1279 Application Data
2471 09:48:48,868222 10.214.140.39 192.168.0.1 TCP 1279 [TCP Retransmission] 56480 → 443 [PSH, ACK]
2472 09:48:49,076276 10.214.140.39 192.168.0.1 TCP 1452 [TCP Retransmission] 56480 → 443 [ACK]
... lots of retransmissions...
I see no FIN
TCP packet from the Kubernetes API server (which would mean, the server wants to close the connection).
After restarting (2 minutes of retransmissions -> reboot), my pod can establish a connection to the API server right away – so the API server itself isn’t overloaded.
The same app runs without any issues on my local Minikube cluster (but there’s of course only one node, so not really representative).
How can I investigate the issue further? Can it be caused by the client side (by my pod or by the Python client)? Is there any special setting I must change on AKS or on my client side to avoid this? Does it look like a ‘server bug’ or a ‘network issue’?