On Fri, 22 Oct 2004, Khovayko, Oleg (NIH/NLM/NCBI) wrote:
> Hovewer, I make my attention to your product, and decided to test it in
> the my "torture room". After source analysing, I've found weakness of
> hash functions, whose are you use in the project.
fortunately these hashes are used on mostly sane data, but you are
correct, these hash functions (hash_string and hash4) could be replaced by
something much better.
If we are to replace these I would propose we replace them with the djb2
hash function which is both extremely simple and rather well proven for
hashing strings. Both hash_string and hash4 is only used in string
context.
djb2
this algorithm (k=33) was first reported by dan bernstein many
years ago in comp.lang.c. another version of this algorithm (now
favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i];
the magic of number 33 (why it works better than many other constants,
prime or not) has never been adequately explained.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
> My function uses its own binary code (and some bytes follow) as
> randomization table.
I don't like this property as it assumes both that the code segment is
readable and that the hash function is in the middle of a much larger code
section.
> Hence - hash values for same strings can be changed after program rebuild.
This is not a problem as these hashes are internal in memory only.
Regards
Henrik
Received on Sun Oct 24 2004 - 09:46:12 MDT
This archive was generated by hypermail pre-2.1.9 : Sun Oct 31 2004 - 12:00:02 MST