This is totally from memory, and untested, but it should get you
started. Requires:
Python http://www.python.org/
WebLog classes http://www.pobox.com/~mnot/script/python/WebLog/
to use: cat access_log | ./script_filename
<----- cut here ----->
#!/usr/bin/env python
from weblog import squid, url
import sys
o_log = squid.AccessParser(sys.stdin)
log = url.Parser(o_log)
domains = {}
while log.getlogent():
if log.log_tag is not 'TCP_DENIED':
domain = make_domain(log.url_host)
domains[domain] = domains[domain] + 1
for domain in domains.keys():
print domain
make_domain(host):
''' take a fully-qualified hostname and return the domain. Many
ways to do this. '''
import string
parts = string.split(host, '.')
if parts[-1] in ['com', 'org', 'net', 'edu', 'gov', 'mil']:
return string.join(parts[-2:], '.')
else:
return string.join(parts[-3:], '.')
<----- cut here ----->
> -----Original Message-----
> From: Francis A. Vidal [mailto:francis@usls.edu]
> Sent: Monday, October 05, 1998 3:45 PM
> To: Squid Users List
> Subject: OFF-TOPIC: Help on script
>
>
> hello everyone,
>
> i'm trying to build a list of sites that i want to ban. i'm
> getting the
> list from the logfile of all the sites that have been visited by all
> users.
>
> this is the format of the logfile:
>
> 907389399.705 61 192.168.2.57 TCP_HIT/200 2172 GET
> http://www.excite.com/pfp/excite/images/big_logo.gif - NONE/-
> image/gif
>
> can someone help me on creating a script that will extract all domains
> that has no TCP_DENIED tag to a file with no duplication? i'm
> not familiar
> with sed, gawk or perl so i need your help on this.
>
> i would like the format to be (from the above example) one domain per
> line:
>
> excite.com
>
>
>
> thanks!
>
> ---
> u s l s N E T university of st. la salle, bacolod city, philippines
> . . . . . . . PGP key at ftp://ftp.usls.edu/pub/pgpkeys/francis.pgp
> francis vidal tel. nos. (6334).435.2324 / 433.3526
>
Received on Sun Oct 04 1998 - 23:06:31 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:42:20 MST