Segfaults on connection loss

Mon Jul 14 11:20:45 CEST 2014

Hi all,

Just wanted to update my status on this. It has recently happened a few times (with around 20 nodes) 
that a network issue with one or two nodes has resulted in the entire vpn network going down as the 
nodes keep reconnecting fast and then segfault (and restart) and end up in a state where all of the 
tincds keep segfaulting and restarting all the time and the vpn never comes up again. A segfault on 
one node causes a segfault on another node basically. At that point the only way to get it back up 
again has been to prevent enough tincds from restarting after the segfault and slowly take them back 
up again after the rest of the vpn has settled.

When that happens, the logs are full of "Flushing x bytes to y would block", unfortunately the 
logging level has been 2, so there's not much additional information available.

We also have both Ipv4 and IPv6 addresses for the nodes if it's relevant.

Note that this is still on tinc 1.0.23, not sure if there's any relevant changes in 1.0.24, but I'll 
try to upgrade to it anyway.

This is what's in the logs:
2014-07-14T04:56:54.79824 Flushing 585820 bytes to be06 (10.9.10.60 port 655) would block
2014-07-14T04:56:54.84673 Flushing 598125 bytes to be06 (10.9.10.60 port 655) would block
2014-07-14T04:56:54.91778 Flushing 636047 bytes to be06 (10.9.10.60 port 655) would block
2014-07-14T04:56:55.02065 be06 (10.9.10.60 port 655) could not flush for 3 seconds (732141 bytes 
remaining)
2014-07-14T04:56:55.02066 Closing connection with be06 (10.9.10.60 port 655)
2014-07-14T04:56:55.02094 Trying to connect to be06 (fc00:9:10::60 port 655)
2014-07-14T04:56:55.03630 Connected to be06 (fc00:9:10::60 port 655)
...
2014-07-14T04:56:56.14377 Trying to connect to fe03 (fc00:9:11::2 port 655)
2014-07-14T04:56:56.16306 Flushing 138968 bytes to fe01 (10.9.11.146 port 52052) would block
2014-07-14T04:56:56.16708 Flushing 177911 bytes to be03 (fc00:9:10::30 port 655) would block
2014-07-14T04:56:56.18829 Flushing 476742 bytes to fe06 (10.9.12.203 port 44588) would block
2014-07-14T04:56:56.20198 Flushing 188423 bytes to be03 (fc00:9:10::30 port 655) would block
2014-07-14T04:56:56.20940 Flushing 153377 bytes to fe01 (10.9.11.146 port 52052) would block
2014-07-14T04:56:56.22510 Metadata socket read error for be06 (fc00:9:10::60 port 655): Connection 
reset by peer <- *Segfault on be06*
2014-07-14T04:56:56.22512 Closing connection with be06 (fc00:9:10::60 port 655)
*Segfault on this node*

The gdb for that one looks like this:

Program terminated with signal 11, Segmentation fault.
#0  edge_del (e=0xa2ef90) at edge.c:93
93                      e->reverse->reverse = NULL;
(gdb) bt
#0  edge_del (e=0xa2ef90) at edge.c:93
#1  0x0000000000408a65 in terminate_connection (report=false, c=0x921030) at net.c:188
#2  terminate_connection (c=0x921030, report=false) at net.c:168
#3  0x00000000004090d3 in check_network_activity (writeset=0x7fff443e1f00, readset=0x7fff443e1e80) 
at net.c:342
#4  main_loop () at net.c:434
#5  0x000000000040478a in main (argc=<optimized out>, argv=<optimized out>) at tincd.c:656

*e = {from = 0xa72f80, to = 0x53c36317, address = {...}, in = {...}, in6 = {...}, options = 
1646294117, weight = 2036688745, connection = 0x36372e3434312065, reverse = 0x3636312e3635312e}
e.connection = (struct connection_t *) 0x36372e3434312065
(gdb) p *e.connection
Cannot access memory at address 0x36372e3434312065

e.reverse = (struct edge_t *) 0x3636312e3635312e
(gdb) p *e.reverse
Cannot access memory at address 0x3636312e3635316e

c.name = "be06"
c.status = {pinged = 0, active = 0, connecting = 0, unused_termreq = 0, remove = 1, timeout = 0, 
encryptout = 0, decryptin = 0, mst = 0, unused = 0}
c.last_ping_time = 1405313815
c.last_flushed_time = 1405313815
c.node.name = "be06"

(gdb) p *c.edge.to
Cannot access memory at address 0x53c36317

c.edge.from = {name = 0x3665363836203231 <Address 0x3665363836203231 out of bounds>, options = 
543373622, sock = 1700750694, address = {...}, hostname = 0x20323a3a33613435 <Address 
0x20323a3a33613435 out of bounds>, status = {unused_active = 0, validkey = 1, unused_waitingforkey = 
1, visited = 0, reachable = 1, indirect = 1, unused = 8443092},  nexthop = 0x3939206320353536, 
prevedge = 0x37, via = 0xe8aa4b5274392b04, subnet_tree = 0x21, edge_tree = 0xc51670, connection = 
0x53c36317, sent_seqno = 9796560, received_seqno = 0, farfuture = 33, mtu = 25367, minmtu = 21443, 
maxmtu = 0, mtuprobes = 9796560, mtuevent = 0x41 }

(gdb) p *c.edge.from.nexthop
Cannot access memory at address 0x3939206320353536

(gdb) p *c.edge.from.connection
Cannot access memory at address 0x53c36317

By the way, in tinc.conf the ping interval and timeout are pretty low:
PingInterval = 3
PingTimeout = 2

Best regards,

-Tuomas

On 27/06/14 08:32, zorun wrote:
> On Thu, Jun 26, 2014 at 08:33:01PM +0200, Guus Sliepen wrote:
>> On Wed, Jun 25, 2014 at 08:40:08AM +0200, zorun wrote:
>>
>>> I have been able to trigger this segfault reliably, with Tinc 1.0.24:
>>>
>>> 0/ setup your local Tinc node to connect to a remote dual-stacked Tinc
>>>     node (that is, the remote node has both a A and a AAAA record)
>>>
>>> 1/ run Tinc in debug mode, "tincd -n NETNAME -D -d 3"
>>>
>>> 2/ wait until Tinc establishes a connection with the remote server
>>>     (lots of "Got PACKET from REMOTE (XX.XX.XX.XX port 656)" messages)
>>>
>>> 3/ apply an insane amount of delay on your physical interface: "tc
>>>     qdisc del dev eth0 root netem delay 10s"
>>>
>>> 4/ wait for Tinc to timeout on a ping.  It will then try to reconnect,
>>>     and crash:
>>
>> I tried to reproduce it, but I do not get a crash. Running it in
>> valgrind doesn't produce any errors either. One difference I observe
>> with your log:
>>
>>> REMOTE (XX.XX.XX.XX port 656) didn't respond to PING in 5 seconds
>>> Closing connection with REMOTE (XX.XX.XX.XX port 656)
>>> Sending DEL_EDGE to everyone (BROADCAST)
>>> UDP address of REMOTE cleared
>>> UDP address of OTHER_SERVER1 cleared
>>> UDP address of OTHER_SERVER2 cleared
>>> UDP address of OTHER_SERVER3 cleared
>>> UDP address of OTHER_SERVER4 cleared
>>> Sending DEL_EDGE to everyone (BROADCAST)
>>> Trying to connect to REMOTE (2001:db8::1 port 656)
>>> Connected to REMOTE (2001:db8::1 port 656)
>>> Sending ID to REMOTE (2001:db8::1 port 656)
>>> Timeout from REMOTE (2001:db8::1 port 656) during authentication
>>
>> I wonder how you get to this point with the tc qdisc command, because if
>> I do that then it never gets to the point where it makes a connection to
>> the other side, because the SYN packets are also delayed 10 seconds. So
>> I get:
>>
>> Trying to connect to REMOTE (2606:2800:220:6d:26bf:1447:1097:aa7 port 655)
>> Connected to REMOTE (2606:2800:220:6d:26bf:1447:1097:aa7 port 655)
>> Sending ID to REMOTE (2606:2800:220:6d:26bf:1447:1097:aa7 port 655)
>> Timeout from REMOTE (2606:2800:220:6d:26bf:1447:1097:aa7 port 655) during authentication
>
> Isn't that the same output?  Didn't you mean that you *don't* see the
> "Connected to REMOTE" part?
>
> But you are right, I forgot to mention I was using a SOCKS5 proxy on
> localhost, sorry (I have even removed that from the log, it seems :/).
> It makes sense: the TCP connection is established immediately with the
> proxy on localhost, since it is not affected by the delaying qdisc.
>
> You can try this:
>
> - ssh -D 4242 -N remote-host
> - "Proxy = socks5 localhost 4242" in tinc.conf
>
> Or even a simple SSH tunnel, the result is the same.  Assuming your
> ssh client will listen on ::1 and your OS doesn't use IPV6_V6ONLY by
> default:
>
> - "ssh -N -L 6565:localhost:655 remote-host"
> - hardcode the remote address to both "::1 6565 and "127.0.0.1 6565"
>    in the host file
>
>> Another difference is that you have 4 other nodes connected to the VPN,
>> what happens if those are not running when you do this test?
>
> I can test that, yes.
> _______________________________________________
> tinc-devel mailing list
> tinc-devel at tinc-vpn.org
> http://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel
>