[freenet-dev] Still getting timeouts
Matthew Toseland
toad at amphibian.dyndns.org
Fri Feb 1 17:55:24 UTC 2008
On Thursday 31 January 2008 21:21, Robert Hailey wrote:
>
> >>> Oh?
> >>
> >> Every time I look at my opennet peers, I *always* have at least two
> >> with pings greater than 2 seconds. Right now, one with 4.5 secs, and
> >> one with 8.9 (the rest are sane).
> >
> > Hmmm. Doesn't happen for me, although I only have 4 or 5 opennet
> > peers.
> >
> > It seems extraordinarily unlikely that this is real - either this is
> > a stats
> > bug, or a message layer bug.
>
> And if it is a message layer bug, that means it may be directly
> related to the timeouts.
It's also possible it's just due to nodes being hideously overloaded. Which
can be due to several causes:
- The startup spike. Which can last a long time because we have no request
resuming.
- Out of memory causing continual garbage collection. Several users have
reported that this happens after 12 hours or so of uptime.
- ....
>
> >>>> In the past while examining the throttle controls, I have suspected
> >>>> that (with priority queues) the "90-seconds at full throttle"
> >>>> constant
> >>>> might actually reduce to taking on too many concurrent chk
> >>>> transfers
> >>>> for them all to complete on time.
> >>>
> >>> Why? IIRC we include a fudge factor in that calculation, admittedly
> >>> it isn't
> >>> very accurate and should be made more so by using stats on bandwidth
> >>> usage...
> >>
> >> Just that the CHKs all use the same throttle, so they all throttle-
> >> down when we accept another CHK transfer.
> >
> > Well sure, but if the mechanism is working we won't accept enough to
> > be a
> > problem.
>
> I'm not saying this is an issue, but when a node is busy the 90-second-
> standard might actually make the average chk transfer time (over long
> distances) always exactly 90 seconds (through the busiest node). Since
> the transfer timeout is 120 seconds, this actually leaves only 30
> seconds to accumulate acceptable latency; by your previous value of 30
> hops, this means one second per hop (1/2 ping time plus coalescing
> delay?).
Hmmm. Perhaps. So we should reduce the 90 seconds to say 60 seconds? That
might cut actual bandwidth usage...
>
> Or else, how many transfers are aborted because nodes disconnect, and
> if they would succeed if the target transfer time was shorter than 90
> seconds? Particularly as the CHK is streaming, that the traffic up
> unto the abort is wasted (50% payload?).
Hmmm. IIRC that is fatal?
>
> >>>>> Do timeouts show up in simulation?
> >>>>
> >>>> I don't normally watch for them, I've started a new run with
> >>>> Accepted
> >>>> & Fatal request timeouts being logged. So far nothing.
> >>>
> >>> Ok.
> >>
> >> After running the simulator for two hours w/ ten nodes, I spot
> >> exactly
> >> one Accepted timeout (17 minutes into the simulation).
> >>
> >> So the answer is yes... timeouts still occur in the simulator.
> >
> > Suggests a messaging bug, although it's possible it's an artifact of
> > java's
> > lack of thread priorities on *nix (i.e. cpu issues).
>
> I would be more inclined to think a messaging bug, it is a beefy
> machine and it occurred some time into the simulation.
>
> >>>>> What can we do to debug this?
> >>>>
> >>>> Probably:
> >>>> (1) a simulated high-ping times seen in the public network at about
> >>>> the same rate,
> >>>
> >>> You mean bugs cause high ping times and high ping times cause
> >>> timeouts?
> >>>
> >>>> (2) a message/link layer stress test complete with rekeying/
> >>>> disconnects/and [busy/not-busy] spikes
> >>>
> >>> This would be a good idea, I dunno how much work would be involved?
> >>>
> >>> What can I usefully work on in this area? AFAICS:
> >>> - The window-grows-while-unused bug.
> >>> - More accurate bandwidth liability limiting.
> >>> - Debug the not-forwarded detection and make assumeNATed false by
> >>> default.
> >>> (Reduce baseload bandwidth usage).
> >>>
> >>> Anything else? You want to take any of these on?
> >>
> >> I don't think I can take on a big project right now.
> >
> > Is there anything I can do?
>
> I am not familiar with the window-grows-while-unused bug, and am not
> working on/debugging the message layer right now. It's up to you.
I will fix the window-grows-while-unused bug.
W.r.t. messaging layer bugs, please explain how to reproduce your simulation;
commit whatever source is needed.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://emu.freenetproject.org/pipermail/devl/attachments/20080201/6ae6c9d0/attachment.pgp
More information about the Devl
mailing list