--MimeMultipartBoundary
Content-Type: TEXT/PLAIN; charset=US-ASCII
On Wed, 20 Aug 1997, Duane Wessels wrote:
>I haven't made up my mind yet if its okay to lose URLs.
>
>Alex Rousskov has an interesting table on hash sizes versus collisions:
>
>http://www.cs.ndsu.nodak.edu/~rousskov/research/papers/static.proxy/
Agreed, MD5 or HSA (HSA is the default/preferred algorithm in the Linux
kernel, but I'm not sure if that's just because of cryptographic security,
patent restrictions, performance or whatever... and we'd be more worried
about a combination of performance and collision probability) would be
"good enough". I don't think it would be acceptable if we had to
double-check the URL on disk for every ICP hit (an open() and read()
before even returning the hit, then probably close the file and re-open()
later), but MD5/HSA should handle URL hashing enough to ignore the
possibility of collision (until someone has a cache with more URLs in it
than there are currently in the world, by many factors of 10).
This leaves the question of what to do with the URL. Can you just throw
it away? Well... it would certianly be nice to have a fixed structure
fixed record length "log" file. One (obvious?) problem I can think of tho
is the removal of old items from the cache. Unless cache purging is to be
done purely on an LRU or similar basis (hmmm, decline page usefulness
every X hours by some constant (8? 50?), increase it by 1 every hit... or
some non-linear function? the way the page/buffer cache in linux
works...). When a new request for the URL is recieved, it can be decided
if the object is out of date or not since you now have the real URL.
The idea of storing extra metadata in the objects in the cache is
interesting (the putting the url at the beginning). Allowing for a
rebuild from data, although slow, after re-arranging the cache or whatever
would be nice, but if we check the on-disk URL all the time then it's not
so nice because of the performance of ICP queries (or maybe just give a
false "yes" and then return a tcp denied and fix up squid in a way that
it deals with tcp denied by retrying the request from a different
peer/parent?). Basically, I like the idea of keeping the URL on disk
unless it's actually used all the time (ie, "it's there, why not use it,
MD5/HSA _could_ be wrong you know" is not the right attitude for ICP
queries).
My opinion (and it's just that)...
* the "log" file could be simply a raw binary fixed record length data
file. 3x32 bits + 1x64 bits or whatever. certianly would be a LOT
faster to read and a LOT smaller. to MD5 or HSA a small string is quite
fast.
* the URLs could be kept on disk as the first line of the cache swap file.
it would nuke all existing caches, but, this would only be in the
upgrade to 1.2 for most users, and if they are patient enough they
*could* run a script... this would then mean that people wouldn't loose
their cache in a loss of "log"; maybe the format of the first line of a
swap file should be the format of a full line of the current "log" file.
it wouldn't increase real disk usage or I/O in most cases since real
hardware talks in typically 512-byte... 4k or more blocks.
David.
--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:42 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:24 MST