subnet flooded with lots of ADD_EDGE request

Tue Jan 8 05:00:19 CET 2019

On Wed, Dec 19, 2018 at 12:05 AM Guus Sliepen <guus at tinc-vpn.org> wrote:
>
> On Tue, Dec 11, 2018 at 02:36:18PM +0800, Amit Lianson wrote:
>
> >   We're suffering from sporadic network blockage(read: unable to ping
> > other nodes) with 1.1-pre17.  Before upgrading to the 1.1-pre release,
> > the same network blockage also manifested itself in a pure 1.0.33
> > network.
> >
> >   The log shows that there are a lot of "Got ADD_EDGE from nodeX
> > (192.168.0.1 port 655) which does not match existing entry" and it
> > turns out that the mismatches were caused by different weight received
> > by add_edge_h().
> >
> >   This network is consists of ~4 hub nodes and 50+ leaf nodes.  Sample
> > hub config:
> [...]
>
> Could you send me the output of "tincctl -n <netname> dump graph"? That
> would help me to try to reproduce the problem. Also, if you could do
> "tincctl -n <netname> log 5 >logfile" when the issue occurs, on the node
> that gives those "Got ADD_EDGE which does not match existing entry"
> messages, and let it run for a few seconds before stopping the logging,
> and send me the resulting logfile.

  Ouch, I've managed to downgrade most of our nodes to 1.0.35 release
since it 'only' hogs
the CPU when something goes wrong.

  Is there any 'tincctl dump graph' alternative in 1.0.35?  Would command like
'killall -INT tincd; killall -USR2 tincd' provides enough debugging information?

> >   Back to the days of pure 1.0.33 nodes, if the network suddenly
> > fails(users will see tincd CPU usage goes 50%+ and unable to get ping
> > response from the other nodes), we can simply shutdown the hub nodes,
> > wait for a few minutes and then restart the hub nodes to get the
> > network back to normal; however, 1.1-pre release seems to autoconnect
> > to non-hub hosts based on the information found in /etc/tinc/hosts, which
> > means that the hub-restarting trick won't work.  Additionally, apart
> > from high CPU usage, 1.1-pre tincd also starts hogging memory until
> > Linux OOM kills the process(memory leakage perhaps?).
>
> You can disable the autoconnect feature by adding "AutoConnect = no" to
> tinc.conf, unfortunately you'd have to do that on all nodes. And it

  Thanks for the hints.  BTW, it turns out that 1.1-pre release will
automatically
keep a copy of received ed25519 key in /etc/tinc/$NET/hosts.  Would
disabling AutoConnect also disable this key-caching feature?

  I'm curious about this since some of our deployments have read-only
/etc/tinc/$NET/hosts,
which could be a problem if key write-back is a necessary in 1.1-pre release.

> doesn't solve the actual problem. If it's hogging memory, that
> definitely points to a memory leak.
>
> >    Given that many of our leaf nodes are behind NAT thus there's no
> > direct connection to them expect tinc tunnel, I'm wondering about if
> > there's any way to bring the network back to work without shutting
> > down all nodes?  Moreover, is there any better way to pin-point the
> > offending nodes that introduced this symptom?
>
> I hope the output from the "log 5" command will shed some more light on
> the issue, as it will show which nodes the offending ADD_EDGE belongs
> to.
>
> --
> Met vriendelijke groet / with kind regards,
>      Guus Sliepen <guus at tinc-vpn.org>
> _______________________________________________
> tinc mailing list
> tinc at tinc-vpn.org
> https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc