I have a slight variation on the "what URLs does squid know about"
question... It would be useful to be able to use squid to reduce web
crawling overhead for a search engine, not merely by direct caching,
but by having secondary indexers get an explicit lists of what pages
are "free" to fetch.
I've come up with a few ways of doing this, all of which have flaws:
1) just export the squid logs.
* not incremental
* not accurate - they show what has been seen, but not what is
still around
2) use a redirect_program
* performance risk
* *only* incremental
* requires new programs on the cache box
3) squidclient cachemgr:objects
* not incremental (but fast enough to make this less of a problem)
* only has MD5 hashes for objects that aren't still in memory
The last option was the most interesting until I discovered the
in-memory distinction -- I've learned a bunch more about squid
internals in the process, though :-) Basically, vm_objects gives
everything for which URLs are still around; objects gives everything,
but only reports hash keys for the on-disk ones (since that's all it
knows.)
This approach might be salvagable, for example, if there were a way
[which I haven't found] to retrieve a document by cache-key instead of
filename -- the cache knows what URL the cache-key maps to once it
opens the file, after all.
Any thoughts? Suggestions for other approaches? Of course I have a
preference for using builtin features, since it is easier to get
administrative cooperation about config file changes than installing
new programs.
> You can't. Not even Squid knows this. Squid only knows MD5 hashes of
> all URL's.. MD5 is used to conserve memory and speed up lookups. (a
> MD5 is 16 bytes, while an URL is anything from 9 bytes to several KB)
Received on Fri Nov 01 2002 - 18:26:28 MST
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 17:11:07 MST