subnet flooded with lots of ADD_EDGE request

Tue Dec 11 07:36:18 CET 2018

Hello,
  We're suffering from sporadic network blockage(read: unable to ping
other nodes) with 1.1-pre17.  Before upgrading to the 1.1-pre release,
the same network blockage also manifested itself in a pure 1.0.33
network.

  The log shows that there are a lot of "Got ADD_EDGE from nodeX
(192.168.0.1 port 655) which does not match existing entry" and it
turns out that the mismatches were cuased by different weight received
by add_edge_h().

  This network is consists of ~4 hub nodes and 50+ leaf nodes.  Sample
hub config:
  Name = hub1
  ConnectTo = hub2
  ConnectTo = hub3
  ConnectTo = hub4

  Leaf looks like:
   Name = node1
   ConnectTo = hub1
   ConnectTo = hub2
   ConnectTo = hub3
   ConnectTo = hub4

  Back to the days of pure 1.0.33 nodes, if the network suddenly
fails(users will see tincd CPU usage goes 50%+ and unable to get ping
response from the other nodes), we can simply shutdown the hub nodes,
wait for a few minutes and then restart the hub nodes to get the
network back to normal; however, 1.1-pre release seems to autoconnect
to non-hub hosts based on the information found in /etc/tinc/hosts, which
means that the hub-restarting trick won't work.  Additionally, apart
from high CPU usage, 1.1-pre tincd also starts hogging memory until
Linux OOM kills the process(memory leakage perhaps?).

   Given that many of our leaf nodes are behind NAT thus there's no
direct connection to them expect tinc tunnel, I'm wondering about if
there's any way to bring the network back to work without shutting
down all nodes?  Moreover, is there any better way to pin-point the
offending nodes that introduced this symptom?

Thanks,
A.