Proposals for UDP information transport over the metagraph

Sun Sep 28 19:29:49 CEST 2014

While working on SPTPS UDP relaying I realized that there is one issue
I didn't account for, which is that the sending node only knows the
PMTU to the first relay node. It doesn't know the PMTU of the entire
relay path beyond the first hop, because the relay nodes don't provide
their own PMTU information over the metaprotocol.

Now, in the legacy protocol this is not really an issue, because TCP
MSS clamping (which is really what matters in practice) is applied on
a hop-per-hop basis, i.e. each hop clamps the MSS according to the
PMTU information it has. Unfortunately, that doesn't work for SPTPS
because relay nodes can't decrypt the contents of the packets that
pass through them. This can result in relays falling back to TCP in
the middle of a relay path if part of the relay path has a smaller
PMTU than the first edge.

Therefore it appears we need a way to figure out, for a specific node,
what the "indirect PMTU" (i.e. PMTU when relaying) is. This
information would presumably be stored in an "indirect_mtu" field (or
something like that) in node_t, and maintained separately from the
"mtu" field which is set by the normal discovery process.

Interestingly, this piece of information could perhaps also be used to
kickstart PMTU discovery, using the indirect PMTU as a "hint" to
constrain the initial range and thereby making it converge faster on
the actual PMTU.

Note that the dynamic UDP address information (currently transmitted
via ANS_KEY messages) poses similar issues, and therefore these
proposals can be applied to UDP address data as well with the same
pros and cons. In the end I expect us to adopt the same proposal for
both PMTU and address information. Arguably, having a robust dynamic
way to transmit fresh UDP address information could make edge
addresses obsolete (except of course in the last proposal below) since
the address information can be obtained on-the-fly and TCP used in the
mean time.

I came up with a few proposals that are described below. I'm currently
leaning towards the "periodic unicast info messages" proposal.
Comments welcome.

PROPOSAL: ADD PMTU INFORMATION TO ANS_KEY

ANS_KEY messages are already used for opportunistically transmitting
UDP address information. We can use the same mechanism to transmit
PMTU information as well.

Pros: Extremely simple. High-quality information that takes into
account perspective from intermediate nodes to provide the best
possible bet for UDP address and PMTU data.
Cons: PMTU is a dynamic piece of data that is subject to real-time
changes. ANS_KEY messages, on the other hand, are sent quite rarely.
This can result in PMTU information quickly becoming stale, and the
mechanism is dangerously dependent on the precise timing of the very
first ANS_KEY message (i.e. whether it is sent before or after PMTU is
discovered).

PROPOSAL: PUSH-STYLE BROADCAST MESSAGES

This proposal consists in adding a new type of broadcast message whole
sole purpose would be to provide the PMTU between two nodes. This
message would be sent to all nodes every time PMTU information
changes. So, for example, if node A discovers its PMTU to node B is X,
then it would broadcast PMTU(A, B) = X to all nodes. Each node would
then keep track of these messages using a dedicated data structure
containing all known PMTU pairs. Every time a metaconnection is
established, nodes would dump the contents of this data structure to
each other.

Pros: high-freshness, immediately available UDP information via
low-latency push notifications. High-quality information that takes
into account perspective from intermediate nodes to provide the best
possible bet for UDP address and PMTU data.
Cons: questionable scalability due to the use of broadcast messages
and the need for each node to keep global state about not just nodes
(which it already does), but about node *pairs* as well. An extreme
example would be a 1000-node graph on which someone decides to run
Bittorrent, thereby making all nodes discover PMTU to each other: this
would result in O(1000000) messages getting sent and the in-memory
data structure would grow to at least 17 megabytes. A back-of-the
envelope calculation suggests transmitting the contents of the data
structure to a joining node would use ~30MB of network traffic.

PROPOSAL: PERIODIC UNICAST INFO MESSAGES

In this proposal, PMTU information would be exchanged only between
nodes that are currently sending packets to each other, and would be
refreshed every few seconds or so. So for example, if node B is
currently receiving packets from node A, it would send a small "UDP
information message" directed at node A over the metaprotocol every
few seconds. Contrary to the previous proposal, this message would be
unicast, not broadcast. When B stops receiving packets from A, it
stops sending UDP information messages to A.

The original UDP information message would be empty, because B doesn't
really know anything about its *own* PMTU or UDP address. Instead,
what happens is that the nodes sitting between A and B would alter the
message as it is transmitted, replacing the information in it with the
node's own UDP information about node B. This is somewhat similar to
how UDP address information is currently populated in ANS_KEY
messages, with a small difference: if a node has confirmed working UDP
information about node B, it will *erase* the information in the
message as it retransmits it, even if the message already contains UDP
information. The rationale is that UDP information is more relevant
the closer it is to the packet sender (A). ANS_KEY currently doesn't
behave that way, in that it will not replace UDP address information
if it's already there (that might be an oversight, because I don't
think this behavior makes much sense IMHO).

Of course, this is mainly useful when B is receiving packets over TCP,
because that means A might me missing information about how to reach B
via UDP. However, it is also useful to send this message even while
receiving UDP packets to inform A of any PMTU increases, and also to
optimize intermediate relaying steps (the alternative would be to make
the relays initiate information messages as opposed to the final
recipient node, but that would result in more traffic for no apparent
benefit).

Pros: High-quality information that takes into account perspective
from intermediate nodes to provide the best possible bet for UDP
address and PMTU data. Good freshness.
Cons: Not as fresh as "push" notifications (probably not that big a
deal). More metaconnection traffic, though I think this is acceptable
because the messages are unicast, small, rate-limited and are only
exchanged between nodes that are actively exchanging packets.
Presumably the traffic from the packets themselves dwarfs the traffic
from the information messages.

PROPOSAL: DYNAMIC EDGE MUTATIONS

This proposal goes like this:

- Add PMTU information to edges. Assuming node A has a direct edge to
node B, then edge(A, B).mtu = node(B).mtu (from the perspective of
node A).

- Broadcast edge PMTU information in ADD_EDGE messages.

- If the PMTU changes, then the edge would be updated and a new
ADD_EDGE message would be broadcast again, thereby dynamically
*mutating* the edge as the situation changes. (as a matter of fact, it
turns out tinc already supports updating edges in that way, though it
currently prints a warning when that happens)

- sssp_bfs() would take PMTU into account while traversing the graph,
and update each node with the indirect MTU. Indirect MTU would be
defined as the minimum of all PMTUs across the edges that form the
path to that node.

Interestingly, this proposal is not perfectly optimal, because of the
following situation with graph A <-> B <-> C <-> D: if PMTU(B, C) = X
and PMTU(C, D) = X but PMTU(B, D) < X, then from the point of view of
A, D.indirect_mtu will be X, which means that if A relays a packet to
D through B, B won't be able to send it directly to D because the
packet would be too large. That said, B won't have to fallback to TCP:
instead it will still be able to send it via UDP, but it will have to
relay it through C instead of sending it directly to D. Still
suboptimal though, and it could become painful when dealing with weird
network topologies.

Pro: high-freshness UDP information via low-latency push notifications.
Con: provides lower quality information compared to other proposals,
because it is impossible to find out what information node A knows
about node B if the two don't have an edge between each other (see
example above). More metaconnection traffic (edge updates).