Hi,
Sorry, I pressed the send button by mistake ...
We are having strange Squid troubles, at first, let me describe our setup:
- 4 HP G6/G7 DL380 servers with 16CPUs and 28GB RAM with RHEL 5.4-5.8
64bit and Squid 3.1.12 (custom compiled)
Squid Cache: Version 3.1.12
configure options: '--enable-ssl' '--enable-icap-client'
'--sysconfdir=/etc/squid' '--enable-async-io' '--enable-snmp'
'--enable-poll' '--with-maxfd=32768' '--enable-storeio=aufs'
'--enable-removal-policies=heap,lru' '--enable-epoll'
'--disable-ident-lookups' '--enable-truncate'
'--with-logdir=/var/log/squid' '--with-pidfile=/var/run/squid.pid'
'--with-default-user=squid' '--prefix=/opt/squid' '--enable-auth=basic
digest ntlm negotiate'
'-enable-negotiate-auth-helpers=squid_kerb_auth'
--with-squid=/home/squid/squid-3.1.12 --enable-ltdl-convenience
- Each server has two instances for kerberos/ntlm authentication and
two instances for LDAP authentication (different customers)
- we have a hardware loadbalancer which is balancing request for our
kerberos-customers (4x2 instances) and ldap-customers (4x2 instances),
each has a different IP address.
- average load values are approx 0.5 (5min values)
- approx 60RPS per instance (equally distributed -> 16 * 60 => 960 RPS)
- up to 150Mbit/s traffic per server
- ICAP servers for content adaption (multiple servers with a hardware
loadbalancer in front of it)
From time to time we are having troubles with our Squid servers which
may not be a problem related to Squid, I suspect an OS issue.
Nevertheless, sometimes the servers don't respond to request (even
SSH-requests) or logging in takes forever (reverse lookup failure?) or
even worse, sometimes the server interface is just down (there is no
indication of any problem at the switch port level). If we check the
squidclient output, we can see some hanging ldap authenticators:
squid_at_xlsqit01 /opt/squid/bin $ ./squidclient -h 10.122.125.23
cache_object://10.122.125.23/basicauthenticator
HTTP/1.0 200 OK
Server: squid/3.1.12
Mime-Version: 1.0
Date: Tue, 13 Mar 2012 13:34:07 GMT
Content-Type: text/plain
Expires: Tue, 13 Mar 2012 13:34:07 GMT
Last-Modified: Tue, 13 Mar 2012 13:34:07 GMT
X-Cache: MISS from xlsqip02_3
Via: 1.0 xlsqip02_3 (squid/3.1.12)
Connection: close
Basic Authenticator Statistics:
program: /opt/squid/libexec/squid_ldap_auth
number active: 20 of 20 (0 shutting down)
requests sent: 13316
replies received: 13312
queue length: 0
avg service time: 4741 msec
# FD PID # Requests Flags Time Offset Request
1 12 16038 2150 B 125.885 0 user1 pw1\n
2 24 16043 85 B 119.562 0 user2 pw2\n
3 32 16049 63 B 13.639 0 user3 pw3\n
4 43 16055 21 B 116.143 0 user4 pw4\n
5 46 16059 12 189.002 0 (none)
6 50 16064 1 189.003 0 (none)
7 56 16069 2 0.079 0 (none)
8 60 16074 0 0.000 0 (none)
9 65 16079 0 0.000 0 (none)
10 86 16084 0 0.000 0 (none)
11 88 16095 0 0.000 0 (none)
12 90 16101 0 0.000 0 (none)
13 92 16117 0 0.000 0 (none)
14 95 16122 0 0.000 0 (none)
15 97 16130 0 0.000 0 (none)
16 99 16138 0 0.000 0 (none)
17 101 16144 0 0.000 0 (none)
18 104 16150 0 0.000 0 (none)
19 107 16162 0 0.000 0 (none)
20 109 16173 0 0.000 0 (none)
Flags key:
B = BUSY
W = WRITING
C = CLOSING
S = SHUTDOWN PENDING
2012/03/13 03:00:04| Ready to serve requests.
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
Testing the ldap authentication at CLI level, it is working without
any problems:
root_at_xlsqip02 ~ # /opt/squid/libexec/squid_ldap_auth -b
"dc=squid-proxy" -D "uid=...." -w xxx -h ldaphost -f "(uid=%s)"
user1 pw1
OK
Unfortunately, there is nothing helpful in syslog, e.g.
Mar 13 15:05:19 xlsqip02 last message repeated 2 times
Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0]
libsmb/clientgen.c:cli_receive_smb(111)
Mar 13 15:05:25 xlsqip02 winbindd[4283]: Receiving SMB: Server
stopped responding
Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0]
rpc_client/cli_pipe.c:rpc_api_pipe(790)
Mar 13 15:05:25 xlsqip02 winbindd[4283]: rpc_api_pipe: Remote
machine wienroot1.wien.rbgat.net pipe \lsarpc fnum 0x4008returned
critical error. Error was Call timed out: server did not respond after
10000 milliseconds
Mar 13 15:05:48 xlsqip02 sockd[4235]: warning: accept(2) failed:
Resource temporarily unavailable (errno = 11)
Mar 13 15:06:20 xlsqip02 last message repeated 7 times
Mar 13 15:07:26 xlsqip02 last message repeated 4 times
Mar 13 15:08:27 xlsqip02 last message repeated 4 times
Mar 13 15:09:30 xlsqip02 last message repeated 10 times
Mar 13 15:10:37 xlsqip02 last message repeated 7 times
Mar 13 15:11:39 xlsqip02 last message repeated 11 times
Mar 13 15:12:55 xlsqip02 last message repeated 9 times
Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0]
libsmb/credentials.c:creds_client_check(324)
Mar 13 15:12:57 xlsqip02 winbindd[4331]: creds_client_check:
credentials check failed.
Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0]
rpc_client/cli_netlogon.c:rpccli_netlogon_sam_network_logon(1030)
Mar 13 15:12:57 xlsqip02 winbindd[4331]:
rpccli_netlogon_sam_network_logon: credentials chain check failed
Mar 13 15:13:05 xlsqip02 sockd[4235]: warning: accept(2) failed:
Resource temporarily unavailable (errno = 11)
btw, winbind just sucks ... But I doubt that winbind is the root cause ...
Anyway, we had some NIC issues before (packet drops), at the moment we
disabled all TSO-stuff
root_at_xlsqip02 ~ # ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off
root_at_xlsqip02 ~ # ethtool -i eth0
driver: bnx2
version: 1.9.3
firmware-version: 4.6.4 NCSI 1.0.3
bus-info: 0000:02:00.0
root_at_xlsqip02 ~ # ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 1020
RX Mini: 0
RX Jumbo: 4080
TX: 255
Current hardware settings:
RX: 1020
RX Mini: 0
RX Jumbo: 0
TX: 255
netstat output, if interesting:
root_at_xlsqip02 ~ # netstat -s
Ip:
1031106057 total packets received
32 with invalid addresses
0 forwarded
0 incoming packets discarded
1031105815 incoming packets delivered
943692708 requests sent out
214 dropped because of missing route
34 reassemblies required
17 packets reassembled ok
Icmp:
77877 ICMP messages received
339 input ICMP message failed.
ICMP input histogram:
destination unreachable: 31124
timeout in transit: 3011
echo requests: 43271
echo replies: 467
43804 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 66
echo request: 467
echo replies: 43271
IcmpMsg:
InType0: 467
InType3: 31124
InType8: 43271
InType11: 3011
OutType0: 43271
OutType3: 66
OutType8: 467
Tcp:
12276761 active connections openings
15459221 passive connection openings
10856 failed connection attempts
875610 connection resets received
899 connections established
1027715456 segments received
937603201 segments send out
3241669 segments retransmited
4510 bad segments received.
1712333 resets sent
Udp:
2017646 packets received
13 packets to unknown port received.
0 packet receive errors
2804237 packets sent
TcpExt:
1717 resets received for embryonic SYN_RECV sockets
4122 packets pruned from receive queue because of socket buffer overrun
2794 ICMP packets dropped because they were out-of-window
10477659 TCP sockets finished time wait in fast timer
105 time wait sockets recycled by time stamp
15872733 delayed acks sent
50531 delayed acks further delayed because of locked socket
Quick ack mode was activated 5087770 times
742212 packets directly queued to recvmsg prequeue.
4525346 packets directly received from backlog
88460270 packets directly received from prequeue
444903256 packets header predicted
286109 packets header predicted and directly queued to user
139821827 acknowledgments not containing data received
291299405 predicted acknowledgments
277775 times recovered from packet loss due to fast retransmit
1646 times recovered from packet loss due to SACK data
5 bad SACKs received
Detected reordering 82 times using FACK
Detected reordering 62 times using SACK
Detected reordering 1489 times using reno fast retransmit
Detected reordering 12950 times using time stamp
13026 congestion windows fully recovered
83165 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 5989
18959 congestion windows recovered after partial ack
3362 TCP data loss events
TCPLostRetransmit: 1
57745 timeouts after reno fast retransmit
1198 timeouts after SACK recovery
31172 timeouts in loss state
473117 fast retransmits
1478 forward retransmits
1445453 retransmits in slow start
888214 other TCP timeouts
TCPRenoRecoveryFail: 134832
27 sack retransmits failed
589052 packets collapsed in receive queue due to low socket buffer
4622235 DSACKs sent for old packets
9737 DSACKs sent for out of order packets
15792 DSACKs received
1 DSACKs for out of order packets received
421669 connections reset due to unexpected data
2977 connections reset due to early user close
11177 connections aborted due to timeout
IpExt:
InBcastPkts: 1294824
OutBcastPkts: 648179
root_at_xlsqip02 ~ # uptime
15:43:02 up 6 days, 54 min, 2 users, load average: 0.12, 0.14, 0.16
Usually, if we get high squid authenticators response time, there is
an issue, e.g.:
http://desmond.imageshack.us/Himg840/scaled.php?server=840&filename=squid3ldapauthenticator.png&res=medium
sysctl-values: http://pastie.org/3586014
Sometime restarting the network helps or rebooting the server, but
during the last couple of days, these issue occur way too often.
Has anybody any idea about what else we could check or experienced
anything similar? Thanks!
regards
Peter
On Tue, Mar 13, 2012 at 3:07 PM, guest01 <guest01_at_gmail.com> wrote:
> Hi,
>
> We are having strange Sq
Received on Tue Mar 13 2012 - 14:55:00 MDT
This archive was generated by hypermail 2.2.0 : Wed Mar 14 2012 - 12:00:03 MDT