Packet loss with LocalDiscovery

Sat Jul 20 16:48:15 CEST 2013

Good news: I have found the root cause for the bug, and came up with a fix.

Surprisingly, this is caused by a bug in hash_function(). Because of a 
small mistake in the inner loop, the function will only ever use the 
first 4 bytes of data, and will never look at the remaining bytes.

Incidentally, the node UDP address cache uses this function with a key 
of type sockaddr_in. This structure contains the IP protocol and port in 
the first 4 bytes, and the IP address in the remaining bytes. With the 
broken hash function, this means that all addresses that have the same 
IP protocol and port will return the same hash.

This is demonstrated by the following test code:
https://github.com/dechamps/tinc/compare/1828908...hashtest

Which outputs the following:
     Hash test: 10.172.1.1:656 (10.172.1.1 port 656) -> 144
     Hash test: 10.172.1.5:656 (10.172.1.5 port 656) -> 144
     Hash test: 94.23.24.223:656 (94.23.24.223 port 656) -> 144
     Hash test: 127.0.0.1:656 (127.0.0.1 port 656) -> 144
     Hash test: 1.2.3.4:656 (1.2.3.4 port 656) -> 144
     Hash test: 10.172.1.1:655 (10.172.1.1 port 655) -> 104
     Hash test: 10.172.1.5:655 (10.172.1.5 port 655) -> 104
     Hash test: 94.23.24.223:655 (94.23.24.223 port 655) -> 104
     Hash test: 127.0.0.1:655 (127.0.0.1 port 655) -> 104
     Hash test: 1.2.3.4:655 (1.2.3.4 port 655) -> 104
     Hash test: 10.172.1.1:42 (10.172.1.1 port 42) -> 80
     Hash test: 10.172.1.5:42 (10.172.1.5 port 42) -> 80
     Hash test: 94.23.24.223:42 (94.23.24.223 port 42) -> 80
     Hash test: 127.0.0.1:42 (127.0.0.1 port 42) -> 80
     Hash test: 1.2.3.4:42 (1.2.3.4 port 42) -> 80

To make things worse, tinc's hash table doesn't care about collisions, 
meaning, two keys with the same hash value will override each other. 
This means that if two nodes happen to use the same port number, they 
can't appear in the node UDP address cache at the same time.

Unfortunately, that matches the situation that I put myself in:
- Node "kobol" sits somewhere on the Internet, IP address 94.23.24.223
- Node "zyklos" is on my LAN, local IP address 10.172.1.1
- Node "firefly" is on my LAN, local IP address 10.172.1.5

All these nodes use port 656. Now, here's what happens when 
LocalDiscovery comes into play:

1. LocalDiscovery succeeds, zyklos and firefly are speaking locally, and 
their UDP address cache contains each other's local addresses:
     [144 (2148532224 % 256 = 144)] 10.172.1.5 port 656 -> firefly

2. At some point, kobol will attempt to communicate with one of the two 
nodes. This could happen because they need to negotiate PMTU again, 
which happens every PingInterval seconds.

3. If communication succeeds over UDP, kobol's address will override the 
existing contents of the node's UDP address cache:
     [144 (2148532224 % 256 = 144)] 94.23.24.223 port 656 -> kobol

4. As a result, when the node receives data from the other local node, 
it will not find its address in the UDP address cache and will have to 
fall back to try_harder() to find out from which node it just received data.

5. If one of the nodes starts talking simultaneously with kobol and the 
other local node (e.g. in my test, the two local nodes are pinging each 
other while kobol is renegociating PMTU at the same time), then they 
will constantly override each other in the cache, and try_harder() will 
be called several times per second.

6. Unfortunately, there is what seems to be a "safety" (Oracle 
protection) measure in try_harder() that will make it refuse to "try 
harder" more than one time per second, dropping packets instead.

In summary: when a node is communicating with more than one other node 
simultaneously, the UDP address cache is constantly being overridden, 
try_harder() is called lots of time per second, and as a result, 
throughput is limited to at most one packet per second, which is of 
course completely impractical.

Here's the fix: https://github.com/gsliepen/tinc/pull/5

On 15/07/2013 19:45, Etienne Dechamps wrote:
> Hi,
>
> I believe I have found a bug with regard to the LocalDiscovery feature.
> This is on tinc-1.1pre7 between two Windows nodes.
>
> Steps to reproduce:
> - Get two nodes talking using LocalDiscovery (e.g. put them on the same
> LAN behind a NAT with no metaconnection to each other)
> - Make one ping the other.
>
> Expected result:
> - The two nodes should ping each other without any packet loss,
> hopefully at LAN latencies.
>
> Actual result:
> - I'm experiencing packet loss every (PingInterval) seconds. Each packet
> loss "episode" lasts roughly 1 second and during that time all packets
> are lost. Apparently it happens each time the nodes are exchanging PMTU
> probes. Packet loss correlates with "Sending/Got MTU probe" messages. It
> also materializes in the form of "Received UDP packet from unknown
> source <LAN Address>" messages.
> - There seems to be some "flapping" with regard to the local host
> discovery itself, meaning that it sometimes reverts to the "normal" mode
> of communication for a brief time for no reason. This can be seen as
> elevated latencies.
>
> For aggressive PingInterval values (e.g. 3 seconds) this makes the link
> between the two nodes basically unusable with 30%+ packet loss.
>
> My hypothesis is that during PMTU discovery tinc "forgets" about the
> other node's locally discovered address, which results in packet loss
> because it doesn't recognize packets coming from the local address
> anymore and makes it revert to "classic mode" for a brief time. Then
> after a moment local discovery kicks in again and fixes the situation,
> until PMTU discovery happens again, and so on.
>
> I will continue investigating and try to come up with a fix.

-- 
Etienne Dechamps