Self-DoS

Guus Sliepen guus at tinc-vpn.org
Thu Dec 31 16:01:18 CET 2015


On Wed, Dec 30, 2015 at 05:26:38PM +0000, Pierre Beck wrote:

> I have successfully connected a network of about 60 nodes (many of which are virtual machines) with tinc 1.0 but encounter a severe bug when physical connectivity between two major locations is lost and then reconnected. From what I gathered, many nodes attempt to connect to many other nodes, causing 100% CPU load on all nodes, taking down the whole network with no node succeeding connecting to any node. It seems unable to recover from this state. Luckily I can shutdown and restart most daemons with a few keystrokes, but I have to shutdown all, then start them sequentially and delayed or this "perfect storm" starts all over again.

60 nodes is not a large number for tinc, it should normally handle this
without problems (even on underpowered hardware, like routers running
OpenWRT).

> The overall configuration is switch mode, with mixed IPv4 and IPv6 host addressing. Otherwise config is empty with these tweaks added to attempt mitigating the issue (with no success):
> 
> PingTimeout=15
> UDPRcvBuf=8388608
> UDPSndBuf=8388608
> ProcessPriority=high

Apart from PingTimeout, the other options will not help. Increasing
PingTimeout may indeed prevent tinc from prematurely closing
connections in case of congestion.

> Also, I have tried firewalling the incoming UDP traffic on most nodes, forcing TCP for those connections, to narrow down the problem, but it doesn't seem to change anything.

If anything, that is counterproductive.

> At event time, the logs have these:
> tincd[1093]: Flushing meta data to server1084 (x.x.x.x port y) failed: Connection reset by peer
> tincd[1093]: Flushing meta data to server1070 (x.x.x.x port y) failed: Connection reset by peer
> tincd[1093]: Flushing meta data to server1052 (x.x.x.x port y) failed: Connection reset by peer
> tincd[1093]: Flushing meta data to server1071 (x.x.x.x port y) failed: Connection reset by peer
> 
> And these:
> tincd[1093]: Metadata socket read error for server1076 (x:x:x:x:x:x:x:x port y): Connection reset by peer
> tincd[1093]: Metadata socket read error for <unknown> (x.x.x.x port y): Connection reset by peer

That means the peer decided to close the connection, not the local node,
so the logs on those peers might provide more information why they
closed the connection.

In any case, how is your topology of meta connections (ie, the ones you
specify using ConnectTo)? If, on each node, you ConnectTo all other
nodes, that will cause tinc to generate a lot of metadata. However, you
don't need to do that, only a few ConnectTo statements is usually enough.
If you have a few central nodes to which all other nodes ConnectTo, that
should work fine as well.

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus at tinc-vpn.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20151231/b9a88ea8/attachment.sig>


More information about the tinc mailing list