ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler) – Zookeper Cluster failure

Often enough to damage business our Zookeper memembers fail with “ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)”, I think it happens on one member and it breaks the whole Zk cluster and Brokers after, it looks like https://issues.apache.org/jira/browse/ZOOKEEPER-3036 is related, why is this version in the release, what about upcoming Zookeeper releases, is there anything I can do about it?

Logs:

ZK-0 is [2020-07-03 03:40:28,552] WARN Exception when following the leader (org.apache.zookeeper.server.quorum.Learner)
java.net.SocketTimeoutException: Read timed out


ZK-1 [2020-07-02 10:54:17,681] WARN Unable to read additional data from client sessionid 0x200887c07d10004, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-07-03 03:40:45,745] INFO Expiring session 0x200887c07d10004, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x200887c07d10004 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Expiring session 0x300887c2cee0002, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x300887c2cee0002 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Expiring session 0x300887c2cee0001, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x300887c2cee0001 (org.apache.zookeeper.server.ZooKeeperServer)
### !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! I think here everything starts
[2020-07-03 03:40:45,752] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketTimeoutException: Read timed out
[2020-07-03 03:40:45,765] WARN ******* GOODBYE /10.233.106.50:54428 ******** (org.apache.zookeeper.server.quorum.LearnerHandler)
[2020-07-03 03:40:45,765] WARN Unexpected exception at LearnerHandler Socket[addr=/10.233.113.68,port=36912,localport=2888] tickOfNextAckDeadline:422901 synced?:true queuedPacketLength:2 (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketException: Broken pipe (Write failed)
[2020-07-03 03:40:45,762] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketException: Connection reset


ZK-2 - [2020-07-03 03:41:15,203] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)


BR-0 - [2020-07-03 03:40:37,481] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)


BR-1 - [2020-07-03 03:40:37,481] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)


BR -2 - [2020-07-03 03:40:43,411] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection (edited) 

Kafka and zookeeper image versions:

kubectl describe pods -n kafka-prod | grep -i image |  grep -i zookeeper | tail -1
  Normal  Pulled     30m   kubelet, prod-k8s-w1  Container image "confluentinc/cp-zookeeper:5.4.1" already present on machine
kubectl describe pods -n kafka-prod | grep -i image |  grep -i cp-enterprise-kafka | tail -1
  Normal  Pulled     29m   kubelet, prod-k8s-w1  Container image "confluentinc/cp-enterprise-kafka:5.4.1" already present on machine

Helm chart version:

helm list confluent-prod
NAME            REVISION        UPDATED                         STATUS          CHART               APP VERSION     NAMESPACE
confluent-prod  1               Tue May  5 16:53:20 2020        DEPLOYED        cp-helm-charts-0.4.11.0             kafka-prod

Kubernetes version:

kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:36:19Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Go to Source
Author: anVzdGFub3RoZXJodW1hbg