"Invalid KEX record length" during SPTPS key regeneration and related issues

Sun May 17 20:46:45 CEST 2015

I sent you a pull request that addresses the general issue, at least
for the short term: https://github.com/gsliepen/tinc/pull/83

On 16 May 2015 at 19:36, Guus Sliepen <guus at tinc-vpn.org> wrote:
> On Sat, May 16, 2015 at 04:53:33PM +0100, Etienne Dechamps wrote:
>
>> I believe there is a design flaw in the way SPTPS key regeneration
>> works, because upon reception of the KEX message the other nodes will
>> send both KEX and SIG messages at the same time. However, the node
>> expects SIG to arrive after KEX. Therefore, there is an implicit
>> assumption that messages won't arrive out of order. tinc makes no such
>> guarantee, even over TCP metaconnections, because there is no
>> guarantee the two messages will travel along the same path (consider
>> the case where there is a change in the graph while the KEX and SIG
>> messages are traveling). In fact, messages can even be lost if a node
>> responsible for forwarding them crashed before being able to do so.
>
> You are right. The main issue with the SPTPS datagram protocol is that
> it actually doesn't handle any packet loss or reordering during
> authentication and key regeneration. I will add this, so it will be able
> to run completely over UDP.

Well, actually... in the pull request above I "solved" this problem
simply by doing "if anything weird comes in, just restart the whole
thing". I wonder if this solution could simply be used as the final
long term solution. Indeed, SPTPS already knows how to handle
lost/reordered *data* packets, and handshake packets are only
exchanged rarely (hourly be default), therefore it seems like
restarting the whole thing on a bad handshake is not that expensive.

In fact, I suspect that with this patch, we might be able to let
handshake packets go over UDP and it would run just fine, although I
don't really see why that would be useful in practice.

It's also more reliable since it can potentially address *any* issue
with the SPTPS tunnel, possibly including ones we don't even know of
yet. And the code is much simpler than trying to implement full-blown
packet loss/reorder detection in SPTPS code, I think.

>> This is not so much of an issue for initial SPTPS negotiation because
>> the handshake is restarted after a 10-second timeout, but there is no
>> such timeout for key regeneration,
>
> Indeed, such a timeout should be added.

I don't think the timeout is needed with my patch applied, because if
key regeneration "silently" fails (i.e. the initial KEX packet gets
dropped) then the other node will continue sending packets with the
wrong key, and therefore trigger a SPTPS failure followed by a clean
restart. It would actually recover faster than using a timeout :)

>> The legacy protocol doesn't have that problem because KEY_CHANGED is a
>> broadcast message - meaning it can't really get lost.
>
> Actually, it can just as well, although it is very unlikely to happen
> that a broadcast message can get lost, and even less likely that this
> happens right when a KEY_CHANGED message gets sent.

That's interesting, can you explain how you see a broadcast message
getting lost? The only way I can see that happening is if a complete
network split happens, but then the nodes will notice that and mark
the "other side" as unreachable, and at that point it doesn't really
matter that the message got lost.