Tweaks for high-bandwidth tinc

Sat Oct 23 18:19:30 CEST 2010

I've been using tinc to do some high bandwidth VPNs between nodes in
Amazon's EC2 environment (to work around some limitations there for
effectively loadbalancing raw TCP connections while preserving the
sources addresses, using Linux IPVS in NAT mode).  The raw amount of
traffic involved is probably making this a bit of a corner case for
tinc.  In the overall it has held up remarkably well under this
scenario, and I really like the ease of configuration and deployment
vs some of the other UDP tunneling options out there.  Most of the
issues I've run into I've been able to tune or configure my way around
effectively.  To recap the most important of those for posterity
though (or maybe faq/doc info):

1) In tinc-up, users will want to adjust the txqueuelen of the tunnel
device as appropriate with ifconfig.  The default it 500, which
resulted in tons of overruns for me.  After trying 2000 for a while
and still seeing overruns, I went with a value of 10000, which seems
to be working well.

2) A single, central tinc daemon serving several high bandwidth client
hosts can easily max out a CPU core on a reasonable machine, even with
authentication, encryption, and compression disabled.  The way to
evade this (assuming most of your traffic flows are in a star rather
than mesh shape anyways) is to spawn separate daemons for separate
point-to-point VPN daemons for each client host, instead of building
one giant VPN with a single daemon at the central host.  This allows
the CPU load to be spread over multiple CPU cores.

There are a few other issues which may warrant some source changes though:

3) When looking at txqueuelen issues, I stumbled on a tun device flag
IFF_ONE_QUEUE (and the corresponding TUN_ONE_QUEUE) in the Linux tun
source.  Some other VPN software out there (openvpn for example) seems
to set this on all tun devices.  Whether it helps or hurts in any
given situation on a modern Linux host is unclear to me just from
reading the source.  It's possible this might warrant an experimental
config setting for Linux, or at least some more research and testing.

4) tinc.conf needs some settings for tuning SO_SNDBUF and SO_RCVBUF on
the tunnel traffic sockets.  I was getting tons of "Lost XXX packets
from YYY" syslog messages initially before I worked around this by
upping the default buffer sizes via
/proc/sys/net/core/[rw]mem_default.  This raises the default for all
sockets opened on the host though, which is pretty heavy-handed.
Adding this seems like a no-brainer to me.

5) Even with all of the above in place, I'm still getting some spam
like this on the app servers:

Lost 133 packets from lb
Got late or replayed packet from lb, seqno 10173333, last received 10173466
Got late or replayed packet from lb, seqno 10173334, last received 10173466
[... every number between 10173334 and 10173464, in correct order ...]
Got late or replayed packet from lb, seqno 10173464, last received 10173466
Got late or replayed packet from lb, seqno 10173465, last received 10173466

This seems to be indicating that occasionally a single packet is
jumping way ahead of the queue at some layer, sometimes by a few
hundred packets, and there may be nothing I can do about that (may be
something happening down beneath my virtual Linux machines, or out on
Amazon's network).

I think maybe the late-packet bitmap could be tuned to handle it
better.  If I'm reading the code correctly, it only allows for a
128-packet window for tracking late/replay, and packets outside of
that window are simply dropped, correct?  I'd rather expand this
window (via a config parameter) and let the upper layers (the TCP
sessions riding above the VPN) deal with the ordering issues than have
the VPN drop the packets for relative tardiness.

6) While looking at the late-packet bitmap, I also noticed the packet
seqs are 32-bit, and when they rollover they force a re-key, which in
turn will restart the sequences at zero.  It's not completely clear to
me yet whether this is an issue or not, but I'm probably rolling these
counters over as fast as once a day in my worst cases (lots of small
packets).  Seem like this has the potential to cause a traffic hiccup
at each rollover for packets that are already queued up with
overflowed sequence numbers when the sudden key change and sequence
reset happens.  Could we handle the rollover more gracefully (e.g.
re-key well ahead of the sequence exhaustion and allow the remaining
sequence numbers using the old keying to still flow through the
buffers before the rollover happens), and/or extend the sequence to a
64-bit counter?

I'll write up some patches for 1.0.X for some of these issues if
nobody beats me to it, but I wanted to see if anyone had more insight
(or objections) first.

Thanks,
-- Brandon