"Invalid KEX record length" during SPTPS key regeneration and related issues

Sat May 16 17:53:33 CEST 2015

Hi,

I'm currently trying to troubleshoot what appears to be a very subtle
bug (most likely a race condition) in SPTPS that causes state to
become corrupted during SPTPS key regeneration.

The tinc version currently deployed to my production nodes is git
7ac5263, which is somewhat old (2014-09-06), but I think this is still
relevant because the affected code paths haven't really changed since.
The only difference is that since 2e7f68a this will not make the
entire tinc process crash, but it will most likely still make nodes
unreachable. The reason why I'm not testing with a newer version is
because it happens quite rarely (say, once per week and not on all
nodes) and is very difficult to reproduce, therefore I don't have a
lot of flexibility here. I have to jump in with what I got.

Anyway, the main symptom is that tinc *sometimes* (rarely) crashes
with the error message "Invalid KEX record length" during its hourly
SPTPS key regeneration. I managed to catch it in a long-running debug
tincd running under gdb. Here's the complete log, with some gdb
investigation at the end: http://pastebin.com/H3qCCAxy

Here's where I got so far in my investigation. Apparently, the reason
why tinc is complaining about the invalid length seems to be that tinc
is expecting to receive a KEX message (65 bytes) but is actually
receiving a SIG message (64 bytes). Looking at the log, I managed to
reconstruct the following sequence of events:

* -> KEY_CHANGED
ake_kobol -> KEX
sharuruzure_izure -> KEX
sharuruzure_sandy -> KEX
ake_kobol <- KEX
ake_kobol -> SIG
ake_kobol <- SIG
ake_kobol -> ACK
sharuruzure_sandy <- KEX
sharuruzure_sandy -> SIG
sharuruzure_sandy <- SIG
sharuruzure_sandy -> ACK
sharuruzure_izure <- SIG = CRASH (expected KEX)

Unfortunately, due to unrelated issues, I don't have any logs from
sharuruzure_izure that I can study.

I believe there is a design flaw in the way SPTPS key regeneration
works, because upon reception of the KEX message the other nodes will
send both KEX and SIG messages at the same time. However, the node
expects SIG to arrive after KEX. Therefore, there is an implicit
assumption that messages won't arrive out of order. tinc makes no such
guarantee, even over TCP metaconnections, because there is no
guarantee the two messages will travel along the same path (consider
the case where there is a change in the graph while the KEX and SIG
messages are traveling). In fact, messages can even be lost if a node
responsible for forwarding them crashed before being able to do so.

This is not so much of an issue for initial SPTPS negotiation because
the handshake is restarted after a 10-second timeout, but there is no
such timeout for key regeneration, which means that if there is some
kind of problem (lost or out-of-order messages), things will get stuck
and the node will become unreachable until the next key regeneration
(1 hour by default!). The legacy protocol doesn't have that problem
because KEY_CHANGED is a broadcast message - meaning it can't really
get lost.

It is not clear to me whether the issue shown in the log above is
actually caused by a KEX message from sharuruzure_izure being lost or
reordered. I don't see why that would have happened, because there was
no change in the graph while that was happening.

Another possibility is that maybe sharuruzure_izure's state was
corrupted all along and it was actually stuck in SPTPS_KEX state
(instead of SPTPS_SECONDARY_KEX which is the normal state for an
established tunnel). If it was indeed stuck in that state, then
instead of sending back KEX and SIG messages, it would only have sent
SIG, resulting in the crash. However it is not clear to me which
sequence of events could result in a node getting stuck in SPTPS_KEX
for a long period of time.

I believe there is yet another, more benign issue with key
regeneration as well: during the short window of time where it
happens, the tunnel is unusable, and packets get lost. Key
regeneration takes 2 round trips over the network, which can easily
result in 300+ ms outages on high latency links.

With these issues in mind, I wonder if it's really worth trying to
patch the current key regeneration protocol - maybe we should simply
come up with a new one. How about simply terminating the current SPTPS
channel and creating a new one? That would remove the need for a key
regeneration protocol altogether, since it's just creating and
terminating SPTPS channels. In fact, req_key_ext_h(REQ_KEY) is already
smart enough to restart SPTPS if there's already a channel.
Furthermore, if we allow the old and new channels to overlap for a
short period of time, we can prevent packet loss during regeneration.