tinc memory usage very high for some instances

Pallinger Péter pallinger at dsd.sztaki.hu
Wed Apr 8 22:41:37 CEST 2020


------- TL;DR -------

Many of the tinc daemons in a large (~50 node) tinc network tend to hog
a lot of memory (>100MB). None of our smaller tinc networks do this.

As to which nodes use a lot of memory, I could not find a
distinguishing factor: tinc version, OS version, kernel version,
being included in ConnectTo and tinc traffic seem all not to determine
memory consumption.
Architecture may be a factor, as only x68_64 ones exhibit high RAM
usage, but I cannot change that.

What could cause such high memory usage? Any tips how to avoid it?
Does anyone have experience with larger tinc networks?

------- The Long Version -------

I have three tinc networks, two smaller application-specific ones
(10-15 active hosts), and one larger that spans most servers and VMs in
the department (~50 active hosts), mainly for administrative purposes.

The larger network consumes a lot of memory on many hosts (but really
modest amounts on some hosts). Generally, this would not be a big
deal, but having a process on _all_ VMs results in a cluster-wide memory
consumption increase of ~5-10GB, which is quite significant.

First, I thought the architecture will be the main differentiator, and
in some ways it is, but there are only two i686 systems, all with old
versions:
 - i686
   - small network: 1.5 MB
   - large network: 9-12 MB
 - x86_64
   - small network: 1-5 MB
   - large network: 3-450 MB (median ~200MB)
The problem is, that I simply cannot switch to i686 on most problematic
machines.

There are a variety of tinc versions running. All from
distribution packages (mostly Debian). The memory usage is varying,
but there are some trends (memory usage was measured with: "ps ax -o
rss,command | grep tinc[d]"):
 - 1.0.13 (only one VM)
   - large network: 3 MB
 - 1.0.19
   - small network: 1-2 MB
   - large network: 3-120 MB (only three under 30MB, curiously only the
     ones built in 2012 were under 4MB, the ones built in 2013 were
     over 10MB, and generally over 80MB)
 - 1.0.24
   - small network: 1-4 MB
   - large network: 9-140 MB
 - 1.0.31
   - small network: 1.5-4 MB
   - large network: 6-420 MB (median around 200MB)
 - 1.0.35
   - large network: 15-450 MB (median around 200MB)
So newer tincs tend to use more RAM, but not _necessarily_ that much
more. Also, the small networks' usage never goes above 5MB.

So I tried to break it up by distribution (only large network observed):
 - Ubuntu
   - 12.04: 3 MB
   - 14.04: 120-160 MB
 - Debian
   - 6: 3-140 MB
   - 7: 12-140 MB
   - 8: 9-110 MB
   - 9: 6-420 MB
   - 10: 15-450 MB

I also tried comparing kernel versions, and there were similarly
widespread numbers: on some 2.6 kernels it would eat 140MB, and on some
4.19 kernels it uses only 6MB.

Being in the ConnectTo list also does not guarantee high memory
consumption, nor does not being listed as ConnectTo preclude a machine.

A tinc restart reduces memory consumption to acceptable levels (under
10MB), memory consumption seems to need more than an hour to rise
again, but I will need more measurements to get something usable.

The above made me curious as whether high traffic causes high memory
consumption, and calculated (RX+TX bytes on the tinc interface)/(RSS).
The resulting number varies between 1 and 50000, so this also seems
inconclusive. I do not know if you can get the bytes forwarded by the
tinc daemon (AFAIK the tun device stats do not include this), does
anyone know that?

So all in all ..... I am stuck. It may very well be that there is a
memory leak somewhere that is triggered by strange coincidences that
need many hosts in the network.

Any tips to solve this mystery are appreciated!

Thanks in advance:
	PP

-----
P.s.: additional information (maybe relevant?):

The big tinc network is installed using the modified ansible-tinc role:
https://github.com/pallinger/ansible-tinc, forked from
thisismitch/ansible-tinc. The main difference is that I can limit
ConnectTo hosts (all hosts were added as ConnectTo in the original
ansible role), as too many ConnectTo hosts caused some tinc
versions (the old ones, maybe (I forgot)) to spin the CPU for more than
10 minutes at startup while not actually forwarding anything on the VPN.
Currently there are 10 ConnectTo hosts on the large network -- the ones
with public IP addresses -- and the network works fine aside from the
high memory usage.


More information about the tinc mailing list