--MimeMultipartBoundary
Content-Type: text/plain; charset=us-ascii
Hi
I have been working on a log-analysis script for squid.
Basically:
The squid logs are damn big. We want to see funky stats for the whole
time the caches have been running, but they come in a 1/4gig a day
so you cannot keep a month's worth of stats.
(Does this problem sound familiar?)
We also want to be able to say 'show me the hit rate by bytes between
1st September and 1st of October... last year'.
Now - since the logs of huge the second option becomes impossible for a
heavily loaded cache.
Solution:
Split the logs into 100 second segments. Summarize the 100 seconds
and dump the data into a database. Keeping almost every stat
I can think of offhand (with some exceptions below) this reduces my
400Mb of data to a grand total of 800kb a day. And I can pull all of the
stats out of it in 6 seconds.
(analysis is about 3->4 times faster than calamaris - see
Problems:
1) You can't keep 'the top site this week was'. Since the key of the
database is the time this causes problems. I have toyed with
the idea of keeping a seperate database for site statistics
(I keep a seperate one to keep stats as to which IP's are denied access)
Since I am mostly interested in the ratio of '.com','.net' vs local
sites I just summarise that.
2) Things may slow down with a large dataset. I have no idea how
db format files are going to handle a huge database with keys
that are numeric (and always increasing). I have only done
stuff with a day's worth of data.
You could always rotate databases though.
3) Perl does some wierd stuff with associative arrays. In normal perl
you can have a 'hash of arrays' (one key, multiple elements
associated with it). When you bind it to a DB file you can't
do this. You have to 'split' them.
Problems with my current code:
1) The user interface sucks. I don't really have one. I am going to
write a web-interface for it so you can do funky things. Please
don't concentrate on this for the moment. I am going to write
this soon.
2) I can't open a database for reading only. This means that if you
try and read stats from a database that doesn't exist it
creates an empty one. :) The fix will take less time than
this message.
3) I don't know much about perl modules. Some of my code is probably
way off course!
4) I use global variables... if you can give me a reasonable 'struct'
like thingum that works with with a DB tie/opendb please let
me know! My current method sucks, since you have to
keep the order that you write data to the database in the
same order or you start messing with values that aren't supposed
to be messed with...
I would appreciate it if you guys could have a look. I would like to
include this in squid-1.2 if Duane thinks it's worthwhile.
For you to do:
1) Check that we can calculate all the stats that you want from the
info we keep.
2) Write scripts to get the stats out. output-template.pl is a good
place to start, I guess.
3) Create a seperate config file
4) Generalise the 'coza' and 'za' cases that I have added for my use.
Perhaps create a seperate database that keeps track of domains
(along with their bytes). You could create a 'watch'
key that keeps a list of IP's that you want to keep an eye
on. You would have to have a util to add and remove sites
from this list...
5) Keep track of errors.
Anything else?
I throw some fields away to suit our cache setup here.... things like
sibling hits aren't counted (since we actually analyse the logs of the
caches that sent out that hit you would merely double some stats).
The analysis scripts are currently v0.3. I expect to have another
version out tomorrow or the next day... so don't expect things to be
static - even the fields may change... :)
ftp://ftp.is.co.za/private/oskar/database-stats-0.3.tar.gz
Oskar
-- "Haven't slept at all. I don't see why people insist on sleeping. You feel so much better if you don't. And how can anyone want to lose a minute - a single minute of being alive?" -- Think Twice --MimeMultipartBoundary--Received on Tue Jul 29 2003 - 13:15:45 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:33 MST