--MimeMultipartBoundary
Content-Type: TEXT/PLAIN; charset=US-ASCII
>Given that many URLs share a common prefix (i.e. 'http://' if nothing
>else), wouldn't it make sense to compress the url before saving
>it. Even REALLY simply dictionary compression would win fairly large
>savings. I.e. staticly calculate the 254 most common prefixes, and use
>the first byte of the key to index into said table. (254 cos you want
>one byte for a '' prefix, and you can't use 0 if you want to keep
>using string operations). Given that you'll get a 6 byte win just from
>'http://', it doesn't seem like a bad idea... Particularly given that
>strcmp() et al still work just fine, it shouldn't even require much
>code changes (only input and display of urls should change).
Well, I'd think that something slightly better than this but still "cheap"
on CPU might be the best choice. I'm not 100% sure that what I propose
below would be a major win or even significantly better than the above,
but at least compression on
1) prefix
2) suffix
3) hostname suffix
4) possibly hostname prefix, but then that could/should/may be part of a
general prefix string in [1]
would be nice.
ie, I'd like to compress
http://www.ibm.com/something.html
to
1ibm2something3
where 1, 2 and 3 are single characters.
URL's are 7-bit ascii. that gives 128(more actually) strings to use
for most-common-substrings without having ot use a prefix code or similar,
ie,
http://www.
http://
https://
ftp://ftp.
ftp://
.com
.org
.htm
.html
.gif
.jpg
www.
ftp.
and so on. Someone could do the (possibly expensive) task of working out
the optimal string set on a group of a few million objects that they have
in their cache, and this could be put in as a static define.
For hostnames, extra substrings could be allocated to the characters
[A-Z_...], since hostnames are lowercased and various charecters aren't
valid (I'd particularly like to make _ not work through caches instead of
just through some caches [dependant on resolver libraries] so people get a
more consistent view of the undesirability of it :). [ie, www. and ftp.
and so on could be in as hostname-only-substrings in this extra set].
An efficent string-matching algorithm is essential for this kind of
compression of course, but it should be achievable without too much rise
in CPU usage. CPU power is much cheaper than I/O or memory bandwidth most
of the time anyway, and various systems have relatively low limits on
physical memory configurations. Also, if one was sufficiently optomistic,
squid caches could communicate their version number and, if compatible,
communicate with compressed URL strings.
Someone with more knowledge on compression might be able to say if this
would be a good, cheap method for compressing short strings. Maybe
something trivial like huffman coding 4-byte blocks of the URLs (and
avoiding coding to "0" since it wouldn't let you strcmp) would be a
win, I don't know...
David.
--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:42 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:22 MST