Re: Improved hash function from Henrik Nordstrom on 2004-10-24 (squid-dev)

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Sun, 24 Oct 2004 17:46:08 +0200 (CEST)

On Fri, 22 Oct 2004, Khovayko, Oleg (NIH/NLM/NCBI) wrote:

> Hovewer, I make my attention to your product, and decided to test it in
> the my "torture room". After source analysing, I've found weakness of
> hash functions, whose are you use in the project.

fortunately these hashes are used on mostly sane data, but you are
correct, these hash functions (hash_string and hash4) could be replaced by
something much better.

If we are to replace these I would propose we replace them with the djb2
hash function which is both extremely simple and rather well proven for
hashing strings. Both hash_string and hash4 is only used in string
context.

djb2

    this algorithm (k=33) was first reported by dan bernstein many
    years ago in comp.lang.c. another version of this algorithm (now
    favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i];
    the magic of number 33 (why it works better than many other constants,
    prime or not) has never been adequately explained.

     unsigned long
     hash(unsigned char *str)
     {
         unsigned long hash = 5381;
         int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}

> My function uses its own binary code (and some bytes follow) as
> randomization table.

I don't like this property as it assumes both that the code segment is
readable and that the hash function is in the middle of a much larger code
section.

> Hence - hash values for same strings can be changed after program rebuild.

This is not a problem as these hashes are internal in memory only.

Regards
Henrik
Received on Sun Oct 24 2004 - 09:46:12 MDT

This archive was generated by hypermail pre-2.1.9 : Sun Oct 31 2004 - 12:00:02 MST