This is a little more abstruse than the regular squid-users fare.
Besides, I'd like to drag some numbers and speculations around a bit and
get people's thoughts on them without the regular crowd jumping to any
conclusions.
(When replying, make sure I get a copy. The list removed me from
squid-dev back a while ago, when I moved states and email addresses)
(There's background and information in my usual waffling style. There
_is_ a specific question at the end supported by all this guff. I would
really like people's opinions)
Okay, I've been put in the position of performing some benchmarks on our
new-version proxy boxes before we deploy. Accordingly, I though
polymix#1 from the bakeoff would be the best thing. I set up a one hour
goal, and arbitrarily threw a 120/second request-rate at it. (System is
dual-Xeon/450 with linux 2.0.36+fd3000patch 1GBram). The system handled
it admirably, and I was pleased (Our squid 1.1.22 boxes on the same
hardware only manage about 40/second with the same polygraph test, and
cannot sustain that rate for the full hour..which is part of what I'm
getting to).
Now, here we come to the curious bit. Correct me if I'm wrong, but the
number of objects actually in the cache should not significantly impact
the lookup time for a given object, because it's a hashed model, yes?
So, at 120/sec the test runs along for nearly an hour. Around the time
we are getting to say 300MB in the disk _buffers_ (the stuff that linux
allocates unused memory to, to avoid confusion at this juncture), the
capacity of the box to handle proxy requests suddenly dips below
120/sec, and the number of in-use connections in polyclt begins to rise
sharply. Within two minutes, the test run aborts.
Checking the access logs reveals connection durations after the critical
mark climbed rapidly to 50,000 ms and more, and are still being
processed, completed and logged for a minute or two after the client has
ceased (check the server...still serving, yes).
Okay. Wait for everything to settle, and grab a coffee. Start the test
again at the same rate, not expecting it to work. Good guess. It
doesn't. polyclt runs out of descriptors and aborts within two minutes.
Access logs on the proxy itself are much the same.
Run test at 100/sec. Runs for some time. Symptoms repeat. I'm quietly
watching 'vmstat 5' this whole time, and watching the disk buffers
filling (490MB now..squid's cache_mem is set at 8MB just as a reference
point..since I later want to examine speed differences for different
sizes). Eyeballing the disks: Hardly working. updated is running at five
second intervals:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 1 0 20 7996 332768 491372 0 0 16 32 95 47 9 4 87
3 0 0 20 7952 332768 491348 0 0 13 374 1947 1979 71 26 3
1 0 0 20 7992 332768 491392 0 0 14 313 1825 2003 75 26
8504885
0 1 0 20 7992 332768 491344 0 0 11 285 1871 1776 70 28 3
1 0 0 20 7956 332768 491416 0 0 21 307 1951 1887 76 27
8488075
2 0 0 20 7968 332768 491392 0 0 10 294 1994 1947 78 24
8504883
1 0 0 20 7956 332764 491372 0 0 18 335 1985 1979 73 22 5
3 0 0 20 7936 332764 491316 0 0 35 347 1798 1846 78 24
8454658
1 0 0 20 8132 332764 491156 0 0 15 335 2052 1889 80 29
8488068
1 0 0 20 8028 332764 491260 0 0 12 286 2131 1938 82 27
8504877
The average number of connections open in polclyt is about 150, at this
point, then suddenly rises sharply, and everything comes apart in short
order. Shortly the cache can't maintain 100/sec for more than a few
minutes.
Now:
ASSUMPTION: Squid's lookup time for a given object does not grow
significantly as the number of objects increases. (It'll grow, but not
hugely)
GUESS: The slowdown is the operating system looking through a
half-gigabyte of cache-blocks to find objects, and actually managing to
exceed the time it would take to fetch the object from the physical
disk.
I could be missing something here. Of course I could. I've hardly slept
in the last couple weeks. Y'know the drill...it's crunch time here in
the dev-group, and deliverables are due on site on the 8th. Whee.
Sleep-time aside, I'm wondering how to ameliorate this consistant
falloff in response. I saw it with squid 1.1.22 when I tested it
(eventually falling to <30/sec, as the cache increased in size...but
definitely not disk bound)
Requesting clue-transfer,
D
Received on Tue Jul 29 2003 - 13:15:57 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:06 MST