From: Etienne Dechamps Date: Sun, 17 May 2015 17:50:11 +0000 (+0100) Subject: Proactively restart the SPTPS tunnel if we get receive errors. X-Git-Tag: release-1.1pre12~154^2 X-Git-Url: https://www.tinc-vpn.org/git/browse?a=commitdiff_plain;h=1a7a9078c093f77950192c32be009bbe463fe372;hp=1a7a9078c093f77950192c32be009bbe463fe372;p=tinc Proactively restart the SPTPS tunnel if we get receive errors. There are a number of ways a SPTPS tunnel can get into a corrupt state. For example, during key regeneration, the KEX and SIG messages from other nodes might arrive out of order, which confuses the hell out of the SPTPS code. Another possible scenario is not noticing another node crashed and restarted because there was no point in time where the node was seen completely disconnected from *all* nodes; this could result in using the wrong (old) key. There are probably other scenarios which have not even been considered yet. Distributed systems are hard. When SPTPS got confused by a packet, it used to crash the entire process; fortunately that was fixed by commit 2e7f68ad2b51648b89c4b5c61aeb4cec67c2fbbb. However, the error handling (or lack thereof) leaves a lot to be desired. Currently, when SPTPS encounters an error when receiving a packet, it just shrugs it off and continues as if nothing happened. The problem is, sometimes getting receive errors mean the tunnel is completely stuck and will not recover on its own. In that case, the node will become unreachable - possibly indefinitely. The goal of this commit is to improve SPTPS error handling by taking proactive action when an incoming packet triggers a failure, which is often an indicator that the tunnel is stuck in some way. When that happens, we simply restart SPTPS entirely, which should make the tunnel recover quickly. To prevent "storms" where two buggy nodes flood each other with invalid packets and therefore spend all their time negotiating new tunnels, we limit the frequency at which tunnel restarts happen to ten seconds. It is likely this commit will solve the "Invalid KEX record length during key regeneration" issue that has been seen in the wild. It is difficult to be sure though because we do not have a full understanding of all the possible conditions that can trigger this problem. ---