Henrik Nordstrom wrote:
>
> There is a small design error in my patch, making it perform slightly
> different than I described.
>
> if (parent_timeout > sibling_timeout)
>
> should read
>
> if (parent_exprep)
>
> Updated patch attached.
>
> Which one is the correct I don't know. Should Squid bother waiting for
> slow siblings when the parents are much faster (at least twice as fast)?
I'd guess probably not...
I've been trying the earlier version of the patch over the last few hours,
and it looks like it works. After it had been running for a while, counts
for the last 2000 access log entries mentioning PARENT or SIBLING in the
status (omitting others) I saw
312 FIRST_PARENT_MISS/cam1.sites.wwwcache.ja.net
932 FIRST_PARENT_MISS/cam2.sites.wwwcache.ja.net
86 FIRST_UP_PARENT/cam0.sites.wwwcache.ja.net
14 PARENT_HIT/cam0.sites.wwwcache.ja.net
7 PARENT_HIT/cam1.sites.wwwcache.ja.net
89 PARENT_HIT/cam2.sites.wwwcache.ja.net
2 SIBLING_HIT/wwwcache.damtp.cam.ac.uk
196 TIMEOUT_FIRST_PARENT_MISS/cam1.sites.wwwcache.ja.net
329 TIMEOUT_FIRST_PARENT_MISS/cam2.sites.wwwcache.ja.net
33 TIMEOUT_FIRST_UP_PARENT/cam0.sites.wwwcache.ja.net
and the ICP ping timeouts mentioned in debugging output were much more
reasonable for the parent RTTs. Initially, I thought it must be getting it
wrong as the logged timeouts seemed far too low, but checking with cachemgr
I found that the actual peer RTTs were much lower than when I was testing
last night - in fact, over a couple of hours this morning the peer RTTs
increased by a factor of ten as load (on our cache, the parents, and the
net) picked up, which servers to emphasise the importance of dynamic
adjustment of the timeout.
The counts above also reinforce my feeling that RTT estimates should
influence routing even for requests that get ICP timeouts from all peers
(assuming that's not because all peers are down... :-) - the FIRST_UP_PARENT
choice was cam0.sites but that parent is clearly being avoided by the
ICP-based routing and it seems to have problems at present - continual
stream of TCP connection failed (with occasional succeeded), etc. Being
first in the list does not mean it's a good (or even reasonable) choice.
A few other miscellaneous points that arose in this morning's testing:
(1) The first couple of times I attempted to start the recompiled Squid 2.2
with the patch, startup failed with an "Arithmetic Exception" message
appearing in the middle of the startup script's output. It's not clear where
that came from, though presumably from squid rather than something else (but
no way to be sure). I couldn't see any core files to help pin it down. At
that point, I had some extra debugging enabled -
debug_options 11,9
debug_options 15,9
debug_options 17,9
debug_options 44,9
in addition to my normal ALL,1 , and as I'd become suspicious that the
debuyg_options could have odd side-effects (see below!), I tried again with
those removed. That worked, but subsequently the Arithmetic Exception did
not recur even when I reinstated the debug_options and reconfigured, or
later when I restarted the server with those options in the config file at
startup. Very odd.
(2) I've mentioned before that adding an extra debug option sometimes seem
to cause all cache.log output to cease, and it was happening again, though
not entirely the same as I'd noticed before. Previously, I'd found that
adding an extra option in addition to those mentioned above would cause all
cache.log output to cease. What I saw today was that with just those options
in the config, normal Squid startup messages etc., were omitted completely
(though I'm sure that's not always happened with earlier testing) *but*
the debugging output (ICP ping details etc.) were logged as normal. Some
shutdown messages still got logged, but not all - e.g. when the cache wasn't
handling live load, it just reported
CPU Usage: 364.210 seconds
Maximum Resident Size: 0 KB
Page faults with physical i/o: 4175
without any of the messages about saving metadata, etc.
(3) I've noticed that when running with live load (but not when essentially
idle), Squid 2.2 often (but maybe not always) produces a list of open FDs at
shutdown, like
1999/05/08 13:29:58| Finished. Wrote 384017 entries.
1999/05/08 13:29:58| Took 2 seconds (192008.5 entries/sec).
CPU Usage: 605.030 seconds
Maximum Resident Size: 0 KB
Page faults with physical i/o: 4747
1999/05/08 13:29:58| Open FD 6 /opt/squid-logs-v2/useragent.log
1999/05/08 13:29:58| Open FD WRITING 39 /opt/squid-logs-v2/access.log
1999/05/08 13:29:58| Open FD WRITING 41 /opt/squid-logs-v2/store.log
1999/05/08 13:29:58| Open FD WRITING 43 squid -> unlinkd
1999/05/08 13:29:58| Open FD WRITING 44 /opt/squid-cache-v2/swap.state
1999/05/08 13:29:58| Open FD WRITING 91 /opt/squid-cache-v2/03/45/00034556
1999/05/08 13:29:58| Open FD 106 HTTP Request
1999/05/08 13:29:58| Open FD WRITING 178 /opt/squid-cache-v2/03/44/000344D1
1999/05/08 13:29:58| Open FD WRITING 240 /opt/squid-cache-v2/03/45/00034529
1999/05/08 13:29:58| Squid Cache (Version 2.2.STABLE2): Exiting normally.
I don't see that with Squid 2.1 (and don't remember it from any earlier
version) - is it just reporting something that was ignored before, purely
for information, or is it an indicator that there's a problem or bug?
My guess (without checking the source code) is that something's not allowing
enough time for a clean shutdown. After getting messages like the above, I
tend to see comments like
1999/05/08 11:50:15| storeSwapInFileOpened: /opt/squid-cache-v2/00/00/000000C9:
Size mismatch: 11884(fstat) != 1229(object)
subsequently, implying that writing out objects from memory really was
interrupted - though I cannot tell if that was because it really was taking
an unreasonable amount of time, or because something was impatient and
didn't allow a reasonable amount of time for cleaning up.
John
-- University of Cambridge WWW manager account (usually John Line) Send general WWW-related enquiries to webmaster@ucs.cam.ac.ukReceived on Tue Jul 29 2003 - 13:15:58 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:07 MST