Hi all,
I have a problem that I am currently trying to diagnose. Please
correct me if you think this should be on the -users list instead, but
I think its a bug and I need further help trying to diagnose it.
We run an array of 30 squid servers, sitting behind a load balancer.
From time-to-time the number of "available file descriptors" gradually
drops from 14K (when under normal load) down to zero, but only on
particular squid at a time. Each squid is configured with a max of 16K
file descriptors. Load across the 30 squid servers remains constant,
at around 200-300 req/sec for each server, and the number of available
file descriptors for the remaining servers remains at around 14K.
When this fault affects a particular server, it starts acting
strangely, like it just stops processing requests for particular web
site(s), while other web sites can still be accessed fine.
For example, google.com and yahoo.com might work OK, but bbc.co.uk
fail - no error messages back from squid or any thing, browser just
hangs. Browsing the same website through an unaffected squid server
(or directly) still works fine.
squidclient mgr:filedescriptors will show a whole heap of file
descriptors (ie 14,000), the vast majority being in the "half-closed"
state, but a few being in the "Reading next request" state.
Running a 'netstat -an' however, shows significantly less actual TCP
sockets (a few hundred), and of these, the majority are in the
CLOSE_WAIT state.
I am concerned that, for which ever reason, squid stops processing
requests for a particular website, and then fails to detect when
clients give up, incorrectly putting the FD into the "half-closed"
state, leading to the situation where the client closes the socket but
squid still thinks that the socket is open.
Dropping the squid server out of service on the load balancer to stop
actual traffic, and then running "squid -k debug" produces the
following messages for lots of different FDs (I presume its for all
FD's that squid thinks are active):
2008/06/23 10:57:55| comm_call_handlers(): got fd=11314 read_event=19
write_event=18 F->read_handler=0x8070890 F->write_handler=(nil)
2008/06/23 10:57:55| comm_call_handlers(): Calling read handler on fd=11314
2008/06/23 10:57:55| clientReadRequest: FD 11314: reading request...
2008/06/23 10:57:55| clientReadRequest: FD 11314 closed?
2008/06/23 10:57:55| cbdataLock: 0x655b16c0
2008/06/23 10:57:55| cbdataValid: 0x655b16c0
2008/06/23 10:57:55| cbdataUnlock: 0x655b16c0
2008/06/23 10:57:55| commSetSelect: FD 11314 type 1
2008/06/23 10:57:55| commSetEvents(fd=11314)
I've looked through the source code but cannot find what read_event=19
means (ie what kind of event is triggering the 'comm_call_handlers()'
call?
With no traffic going through the device, it consumes 14% CPU
processing these events.
I could set "half_closed_clients off", however, even at the start of
the decline in file descriptors (ie when there are still file
descriptors available) there are problems browsing certain websites,
so I think this will just mask a symptom of the problem rather than
fix it.
A simple restart of squid fixes the issue, but only for a while. Our
support guys are having to restart squid on various devices about 5-10
times a day at the moment in order to try minimise impact to our
customers.
We are running squid-2.6-STABLE18 on Linux (Centos 5), using the
epoll() event notification interface.
Any help will be much appreciated as I've run out of things to check.
Please let me know if you need any further information.
Many thanks in advance.
-- Bradley KiteReceived on Mon Jun 23 2008 - 11:47:34 MDT
This archive was generated by hypermail 2.2.0 : Tue Jun 24 2008 - 12:00:09 MDT