--MimeMultipartBoundary
Content-Type: text/plain; charset=us-ascii
> Gzip would want to compress URLs en-masse, and would use a (relatively)
> huge amount of CPU time. This is a scheme for compressing URLs
> one-by-one, keeping them as text, and not using too excessive CPU time.
> Bit-shuffling, if done some way that still prevents a null occurring in
> the stream, could be a good way to reduce the URL string memory impact
> further tho. I'll make sure my "compression" results in something
> permissible by rfc2068 so then a bit-shuffling technique could be applied
> over the top and look at doing this.
How about using something similar to the way BIND compresses its
domain names in its replies? This would be VERY fast and still save a fair
bit. Basically its a dictionary based scheme based on what has already been
seen. You'd need to do some management to make sure that references are
kept appropriately when URLs get deleted. Basically you'd want to build up
an index of URL's separate to the stuff stored in the hash table, and then
you'd index into the URL dictionary on a "index number and length" schema.
With a bit of trick management you could build a good set of dictionary
URLs that would save heaps.
On other thoughts, you could implement a sort of tree whereby the '/'
separated text in an URL are stored. The tree is traversed by finding the node
at the '/' separation level, and the moving on to the next '/' separated node.
This could also be done in a patricia tree style whereby you oly create
the splits when necessary. All you need then to store is the index
to the leaf node of the tree (and add the leaf node if it doesn't exist).
Then provide a routine with builds the URL from the leaf node index. Add
a little garbage collection management and you'd have it done. This could
be written as a fairly straight-forward URL compression module. Interface
would be:
int index_url(char *url);
index_url will take the url and add it to the internal tree (if it doesn't
exist). It returns in the unique index for the URL.
int get_url(int index, char *buffer, int buflen)
Retrieves the URL into a user provided space. Returns 0 on URL not
found (for provided index), and 1 if it is.
int delete_url(int index)
Removes a reference to an URL. The will result in appropriate tree
space compression (ie. joining a split of two distinct URLs in a single URL
path after 1 gets deleted), and removes the space allocated for the URL.
Returns 0 on index not found, 1 if deletion occurred successfully.
This module would be about 1 weeks worth of programming. Don't know
how much savings it would achieve, but that needs to be explored of course.
Whaddya think. If I haven't missed the point I'll write it up (once
I complete the direct "squid as a caching named" mods I'm in the middle of).
Stew.
-- Stewart Forster (Traffic and Web Proxy Cache Administrator) connect.com.au pty ltd, Level 9, 114 Albert Rd, Sth Melbourne, VIC 3205, Aust. Email: slf@connect.com.au Phone: +61 3 9251-3684 Fax: +61 3 9251-3666 --MimeMultipartBoundary--Received on Tue Jul 29 2003 - 13:15:42 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:23 MST