[Tech] [Old frost] Delayed socket reads
Matthew Toseland
toad at amphibian.dyndns.org
Wed Aug 8 14:49:59 UTC 2007
----- Anonymous at o9_0DTuZniSf_+oDmRsonByWxsI ----- 2007.06.17 -
13:53:13GMT -----
Sounds like the proposal should be expressed in more formal way, to ensure
that we all do not speak about different things. Maybe comparing it with
current implementation is a good starting point, so the proposed differences
are:
- instead of using a single system UDP socket, each connection should be given
a distinct UDP socket bound to specific remote address. The UDP socket
non-bound to remote is still used to notice new connections (and connections
where peer changed its IP address) - once newly seen address authentificated,
new bound UDP socket created to handle the connection.
Potentially, specific peer might be sending UDP packets from different IP
address, where the address changes not on daily/hourly basis (which is quite
common) but a remote NAT assigns a random address from a pool of alloted
addresses for [virtually] every packet. I havn't seen so badly broken
internet connections yet; most likely such node would prefer using TCP or
HTTP over any UDP based transport solution anyway.
The unbound UDP socket is given low weight as far as bandwidth sharing
algoritm goes.
- Aside the unbound UDP socket, the per-connection UDP sockets are given equal
bandwith sharing weight (with unused bandwidth somehow (not neccessarily
evenly) distributed among peers that could handle more traffic). Each
peer/socket is handled by a separate thread, and so different peers do never
block each other on network I/O.
Other words, every peer is guaranteed to get at least 1/(P+1) share of
incoming bandwidth - never long delays of no read which might lead to
oscillations (which, as mroger pointed out, could be fought with very
inertious averages, but no need to provoking them also). But as all the
momentarily unused bandwidth is made available for other peers/sockets, the
token bucket is very small, and most likely should be not higher than zero
(using the following semantics: upon request processing the bucket size
becomes negative, and next read from the socket might occur only when the
buckets gets refilled back to zero level at which point the continuous
background refilling stops).
Using NIO the extra threads could be avoided, but it is not really needed
seeing that a node is dicouraged having more than a dozen connected peers
anyway.
The threads will of course be synchronizing on local tokens/bucket handling -
which is expected to be within a microsecond, and tasks like db access -
which is very good as it keeps node from spiralling down due to CPU/disk
performance starvation. Simple implementation would also limit each peer up
to one simultaneous db access - and that is good, as that throttles the peers
that abuse our disk resources (by flooding requests with htl=1 for example).
- the current I/O bandwidth liability limiting mechanism is not deprecated at
all; as toad pointed out multiply times, it is essential for handling tiny
requests that tend to result in large responses often enough.
Once a packet is read from a bound UDP socket:
a) an insert or request: it is verified against the bandwidth liability
limiters; is the checks passed, the insert/request accepted as it currently
does. Otherwise the insert/request gets explicitly rejected
(FNPRejectOverload) as it currently does. Nothing really changes here.
At the moment of accepting the insert/request the peer token bucket might be
charged, in order to discourage floods. Then credit the charged tokens to the
peer we later received the requested data from.
b) data we are awaiting for, either for our node or as forwarded request.
c) auxilarly traffic like acks or location swaps or keepalives.
These undergo no liability limiting, as the transfer already happened -
exactly the same as now. Notice that if the received traffic is 'valuable'
(explicitly requested earlier) we credit the peer with the tokens charged
from the original requestor bucket (the only mechanism to possibly raise the
bucket size above zero), so this peer is not penalized (at very least not as
heavily as now) - and so requests of this peer are not getting postponed for
long.
As the bonus tokens are not just thrown in, but taken from other peers who are
interested in service, there should be no bursts happenning. At least,
charging buckets might be done at the moment the data response sent, not at
the moment of requests (or the charge can be somehow split among those
timepoints).
Now, a question should be raised - if all the liability limiter stuff is
expected to work as it already does, why bother? The trick is that delayed
socket reads will be allowing remote side to notice in advance that the peer
became slower. The remote will not know if that happens due to exceeding the
1/(P+1) share of bandwidth, or due to CPU/disk/swap/whatever performance
shortage: all these situations require slowing down to avoid rejects and/or
timeouts. So the slowing down is made at 21 levels: first, the delayed socket
reads smooth the load over longer time period, and second, the requestor will
start sending/queueing less requests here, more utilizing alternative paths
and less busy peers. And that's the whole point.
Additional note: this algorithm requires no modification if UDP sockets are
mixed with TCP ones (provided that TCP window size is comparable to MTU, not
the default 32/64/128KB).
Until last week I was under impression that the timing based mechanism of
routeing around overloaded nodes is already implemented. After taking time to
study the close closer it turns out to be false: AFAIS the routing around
happens only upon explicit FNPRequestOverload - the
freenet.node.PeerManager._closerPeer() method does not take the peer
roundtrip average into account at all. Here is the [rather simple]
improvement I propose: instead of making routing decision based solely on
location distance skipping 'unsuitable' locations, I propose searching a peer
by choosing the best (numerically highest) request advance speed,
specificially for each node:
- disconnected peer is skipped (as it is now);
- the peers we already tried to search the key at are skipped (as it is now);
- the peer the original request came from is skipped (as it is now);
- once passive requests implemented, other requestors of the key are to be
skipped too (is they gain the key, the subscription will time out soon and
then we will be able to try there);
- [the maxDiff and the old version logic is not too clear for me];
Now the real salt:
- the speed is calculated as signedLocationDiff(our_location, peer_location,
key_location)/RTT(peer) where signedLocationDiff is positive if
locationDiff(our_location, key_location)>locationDiff(peer_location,
key_location), and negative otherwise (assuming our_location!=peer_location,
which would give zero location difference but should never happen). That can
be multiplied by nodeAveragePingTime if values of predictable range
(say -3..3) are preferred (but no need to provide guarantees).
- backed off node speed is subtracted by 3 (or thereabout; but maybe smaller
value, like 1 or even 0.5).
Using the speed for connections sorting is very natural: it basically allows
to answer the question 'which peer the request should be sent to in order to
receive response ASAP - preferring to use slightly more bandwidth on
underloaded nodes (where bandwidth is idle i.e. virtually free) rather than
waiting for slow/overloaded nodes to do the job, and provided that the
solicited key already exists where it should'.
And so it gives the following advantages:
- (major) we start routing around busy nodes well before they start generating
FNPRejectOverload packets.
- (minor) no longer need to scan the peer list twice, first skipping backed
off nodes then including them.
- (noticeable) requests that result in successful data retrieve will be
delivering answer to original requestor faster (on average).
- (average) the routes will be more fluid and thus harder for adversary to
supervise or censor;
- (minor) very fast nodes even with small own storage/cache provide effective
caching services (for popular keys) for their immediate peers which are slow
but have huge well-specialized storage.
And note that this algorithm will be useful even for the current
code/protocol, without delayed socket reads; delaying the socket reads will
undoubtfully have positive impact by smoothing the choice over time (and
efficiently fighting any oscillations) and having routing around busy nodes
to start occuring much earlier (starting from 'boundary' keys, then if the
node anyway gets increasingly busy, by further narrowing down the location
space we try at the busy node).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://emu.freenetproject.org/pipermail/tech/attachments/20070808/92b3fe4a/attachment.pgp
More information about the Tech
mailing list