Sure. Let the world learn from my mistakes. But first a bit of background:
I work for a company that supplies internet access to schools in the US. In
following CIPA (http://www.ala.org/cipa/) we have to provide internet
filtering (via "8e6 Technologies" R3000 filtering appliance). Hence the
central proxy servers mentioned earlier. As an additional filtering
measure, I've created a form where the customers can (on the customer
premise equipment) add other sites they would like blocked. This for allows
for blocking just a particular page on a site, a subfolder of a site or a
whole domain. All they put in is a URL, and choose whether to block the
site or the domain. The regular expressions are generated automatically,
added to the filterurls file, and Squid is reloaded. Here are some example
types, requested urls, and the resultant regexes:
Site: http://gamesondemand.yahoo.com
http://(.*@)?(www\.)?gamesondemand.yahoo.com/.*
Site: http://www.bpssoft.com/powertools/library.htm
http://(.*@)?(www\.)?bpssoft.com/powertools/library.htm
Site: http://www2.photeus.com:8090/~ewot/
(http(s)?://)?(.*@)?(www\.)?www2.photeus.com(:.*)?/~ewot
Domain: http://mail.com (http(s)?://)?(.*@)?(.*\.)?mail.com(/.*)?
Now you may be looking at the regular expressions and asking yourself "What
the hell was he thinking?". I don't blame you. In retrospect, these
regexes are overkill. We had a problem with our filtering service (8e6
Technologies XStop on IIS) at one point where a site that would normally be
blocked (say http://www.playboy.com/ for example) would pass the filtering
service if HTTP authentication was used (http://anystring@www.playboy.com).
As compensation, I gave the customers power to block sites on a case by case
basis and made sure those blocks would cover this situation. Obviously
(again in retrospect) I was being a bit too specific. Then again, I created
this function over two years ago, and my customers have just started really
using this feature, which is what was causing the problems. Go figure.
With one or two sites being blocked this way, and as little traffic as most
of my sites consume, Squid was okay with my incompetence (inexperience?
naivetie?). Once more sites are blocked, matching these complex regexes
gets to be overwhelming.
I'm still working on rewriting the regexes for the above requests. As it
stands now, I'm blocking any domain that has a Site block associated with it
(i.e. all of bpssoft.com is being blocked at the example site).
Here's the proposed solution for site blocking (using url_regex):
(www\.)?gamesondemand\.yahoo\.com/
(www\.)?bpssoft\.com/powertools/library\.htm
(www\.)?www2\.photeus\.com(:[0-9]+)?/~ewot/
This is not as exact, as any url with these stings (such as a netcraft
query) will be blocked, but sadly filtering is not an exact science, and at
least they can surf.
Blocked (and allowed) domains have already been moved to a new acl using
dstdom_regex:
(.*\.mail\.com|mail\.com)$
Which gives more exact results (and quickly), but can't be used to match
just a page or subdomain.
If someone has suggestions on how to make these more granular while
maintaining efficiency, I'm all ears.
Chris
-----Original Message-----
From: Dave Holland [mailto:dh3@sanger.ac.uk]
Sent: Thursday, November 04, 2004 3:10 AM
To: squid-users@squid-cache.org
Subject: Re: [squid-users] Sporadic high CPU usage, no traffic
On Tue, Nov 02, 2004 at 10:56:28AM -0900, Chris Robertson wrote:
> before, neat). I was using two url_regex acls, and the regular
expressions
> I was using seem to be the problem. Removing those two lines dropped CPU
> usage from a low of %50 to a HIGH of 10%. Yikes. Off to optimize them.
It would be interesting to see those url_regex lines, if you're willing
to share them?
thanks,
Dave
-- ** Dave Holland ** Systems Support -- Special Projects Team ** ** 01223 494965 ** Sanger Institute, Hinxton, Cambridge, UK ** "Always remember: you're unique. Just like everybody else."Received on Thu Nov 04 2004 - 15:38:46 MST
This archive was generated by hypermail pre-2.1.9 : Wed Dec 01 2004 - 12:00:01 MST