On Sun, 11 Jul 1999, Henrik Nordstrom wrote:
> Information needed to decide on compression algorithm is:
> * Percentage of digests bits differing between two digests. Should be
> quite linear to traffic amount / cache size, but I need some input on
> what ranges to expect.
The best I can do is to give you the stats from our caches. Here is the
percentage of changed _bytes_ between two consecutive digests from ten
caches: (4.95%, 3.99, 2.83, 3.08, 5.49, 4.18, 3.41, 1.65, 7.24, 4.45%).
I do not have bit stats, but here is a "typical" distribution of the number
of changed bits in a _changed_ byte (1 means 1 bit has changed, 7 means 7
bit have changed):
#bits count percent
1 12605 ( 29.40%)
2 10201 ( 23.79%)
3 11856 ( 27.65%)
4 5903 ( 13.77%)
5 1916 ( 4.47%)
6 352 ( 0.82%)
7 43 ( 0.10%)
> * Locality of the differences. My assumtion is that the hashing causes
> the differences to have random distribution with close to no locality.
I do not know a universal measure of locality, but I would not be surprised
if your assumption of minimum-locality distribution is correct. However,
the fact that about 50% of changed bytes have more than one bit different,
probably contradicts the assumption (given that there is less than 10% of
changed bytes). Perhaps my stats are buggy.
Your best bet is probably to fetch a few digests from real caches and
measure the stuff you are interested in yourself. You will have to do it to
test your algorithm anyway...
Alex.
P.S. Technically, an algorithm could use different strategies based on
actual measurements. However, we want to keep it simple. :) If the
resulting diff is larger than the new digest, we just would not serve it.
Received on Tue Jul 29 2003 - 13:15:59 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:16 MST