Hi,
we're fighting with a very weird problem, running Squid (currently 1.1.17) 
in a mesh of ten caches on Solaris 2.5.1 Ultras (2 CPUs, 256 MB).  Yes, we
do use poll(), and no, we did not (yet) raise the FD limit above 1024
except one machine. 
Over the time of a few months, we found a very high number of connections
in CLOSE_WAIT whenever a Squid was unresponsive to the point of being
unusable. We got curious and started to scan those connections which were
connected to the local Squid TCP port. For instance
HOST                 SUM   ESTAB   TWAIT   CWAIT   OTHER
cache-1              661       9       0     652       0 (+)
cache-2              711       4       0     707       0 (+)
cache-3              917     208     325     384       0
cache-4              790      39       0     751       0 (+)
cache-5             1126     191     518     416       1
cache-6              334       5       0     329       0 (+)
cache-7             1012     186     499     318       9
cache-8              947     246     371     329       1
cache-9              256      57     193       0       6 (*)
cache-10             555       0       0     555       0 (+)
(*) cache-9 is a special machine with 4096 FDs and a Squid NOVM.
(+) can be considered dead as far as caching is concerned.
  SUM: sum of the following columns
ESTAB: connections in state ESTABLISHED.
TWAIT: connections in state TIME_WAIT (*not* eating up FDs).
CWAIT: connections in state CLOSE_WAIT (eating up FDs?).
OTHER: connections in other than the previous states.
The state OTHER includes the TCP states SYN_*, FIN_* and CLOSING. 
Connections in state OTHER are of no importance.  According to the TCP
state transition diagram, connections in state CLOSE_WAIT did receive a
FIN segment, but the application did not yet close() the socket which
would send our own FIN. Therefore we assume that connections in CLOSE_WAIT
are "evil", because they eat up the precious resource of filedescriptors -
even though this is a valid and good state with other protocols. [1]
If compared to a traffic light, a "green" cache would own very few
connections in CLOSE_WAIT (about as many as in ESTAB) and many in
TIME_WAIT. A loaded "yellow" cache does not have more connections in
TIME_WAIT and ESTAB than in CLOSE_WAIT. An unresponsive "red" cache has
extremely many connections in CLOSE_WAIT (at least half your max FDs)  and
almost none in ESTAB or TIME_WAIT.
Strangely enough, a NOVM Squid does not display any of this behaviour.
Do you have any idea what can be done about getting rid of those wicked
CLOSE_WAIT connection (besides switching completely to NOVM)? Why is there
such a difference between the regular and the NOVM Squid? At almost
regular intervals during business hours the various caches become
unresponsive and only a kill to the squid process does revive them. 
If you need more information, http://statistics.www-cache.dfn.de/ shows
the statistics gathered from the various caches. The blank parts in the
FDs diagram and other diagrams show the times a cache was unresponsive to
a cache_manager query. If a cache doesn't answer a cache_manager query, it
won't answer the users, either. 
Here, for your benefit, a chart of busy, but responsive caches:
HOST                 SUM   ESTAB   TWAIT   CWAIT   OTHER
cache-1              854     215     419     217       3
cache-2              273       2       0     271       0 (+)
cache-3             1212     160     590     461       1
cache-4              643     144     265     232       2
cache-5             1066     136     481     444       5
cache-6              958     195     624     133       6
cache-7             1117     185     668     260       4
cache-8             1147     155     607     382       3
cache-9              592      89     498       2       3 (*)
cache-10            1216     337     417     459       3
(*) cache-9 is a special machine with 4096 FDs and a Squid *NOVM*.
    The fact that this cache is busy is shown by the high number 
    of 2 connections in CLOSE_WAIT.
(+) currently restoring his URL database, we won't interrupt that.
We would appreciate *any* idea which might help to resolve this mystery!
[1] this reminds me: Somewhere in the past, Squid lost its ability to
handle half closes, like: I send my query, half close my connection (e.g. 
use Steven's sock tool), but still want to receive the answer before a
complete close. This would account for CLOSE_WAIT connections, but Squid's
behaviour is different, and time-dependend. If my FIN reached the Squid
before it sent its answer, Squid will tear down the connection without
sending the answer. This is good for now, but with the evolution of HTTP
it might become necessary to review this behaviour. 
Thanx,
Jens-S. Vöckler (voeckler@rvs.uni-hannover.de)
Christian Grimm (grimm@rvs.uni-hannover.de)
Institute for Computer Networks and Distributed Systems
University of Hanover, Germany
Received on Mon Oct 27 1997 - 09:24:04 MST
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:37:21 MST