Clustered Store from Andres Kroonmaa on 1998-02-11 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Wed, 11 Feb 1998 15:36:26 +0200 (EETDST)

> >> >> Changes to squid-1.2.beta13 (Feb 4, 1998):
> >> >>
> >> >> - #defining MONOTONIC_STORE will support the creation of disk
> >> >> objects clustered into directories. This GREATLY improves disk
> >> >> performance (factor of 3) over old `write-over-old-object'
> >> >> method. If using the MONOTONIC_STORE, the
> >> >> {get/put}_unusedFileno stack stuff is disabled. This is
> >> >> actually a good thing and greatly reduces the risk of serving
> >> >> up bad objects (SLF).
>
> It changes the order in which new swapfiles are placed into
> the directories.
>
> OLD
> cache/00/00/00000000
> cache/01/00/00000001
> cache/02/00/00000002
> cache/03/00/00000003
>
> NEW
> cache/00/00/00000000
> cache/00/00/00000001
> cache/00/00/00000002
> cache/00/00/00000003

I see. Thats a good start. I guess that it uses roundrobin counter which
wraps at some 2M? Some time ago we had a debate with Arjan about whether
its better to select _the_ lowest available file_no (my idea) or to select
_next_ with wrapping around. Both improves disk perf up to a point.
I would like to bring it up again, I think it has sort of wider effects.

Arjan made a patch for 1.1.10 that implemented this scheme with added feature
to fill each dir with configurable count of files before going on to next.
384 files/dir seems optimal, as dir block size becomes 8192 bytes (exactly
1 disk block) thus increasing bang-for-a-hit (in terms of OS dir cache).
It makes open-for-writes pretty fast.

My idea was to always select lowest free file for the next write. The
effect is that squid never uses directories it doesn't need.
If cache can hold 1M objects, and fill-factor is 384, squid needs to have
1M/384 = 2800 dirs. Having 256 L2 dirs, this makes about 10 L1 dirs. Almost
2 times less. with 8KB dir size, this takes about 22MB on ram for OS to cache
_all_ the dir-files for efficient fileops like open()

Well, while cache is filling, there is almost no difference between these
two methods. Next open for write is very probably fast because "current"
dir is "warm" in the OS cache.

Problems appear after few "wraps", when cache is full, as file_map becomes
more and more fragmented due to differing object lifetimes. Thus "clustering"
becomes less and less efficient and objects become more and more spread
across many directories. Both methods start to fill in the holes and
eventually writes begin into pretty random directories. Disk trashing
returns back. The same is true for my suggestion also, with few differences:

active working directory set is reduced to minimum possible, thus upping
dir cache hit to theoretical maximum (err, on vast average I guess),
writes are sequential as long as there appears no holes due to object
release. Does squid ever release an object without "touching" it on disk?
If it usually does touch, then it "warms up" the dir cache for the next
write anyway and use of unused_files cache is good and filling the hole
is not a "penalty".

Well, with very fragmented disk store, Arjan's method is slightly more
efficient on writes, perhaps, especially if normal garbage collection is
in sync with new incoming objects and "clears the road" for the file_map
counter. Its slighly less efficient on reads although.

I'm running squid with these patches for quite some time now. Cache is
very full, and "emergency LRU" is not so rare. What can I say about
performance? Compared to default squid its a tremendous speedup, as
Arjan has pointed out on his web page. Dayly (10-20h) average time
for open(RO) is at 6.4 mSec, open(WR) is 12 mSec.

I've added some measurements for unlink() calls into squid, and found
that unlink() calls take 40 mSec on average (with some maxes at 100mSec).
So, unlinks are really very expensive.

Now, if compared to Arjan's results, this seems not quite so good, but
I'm not sure if Arjan's cache was filled and into LRU problems by the
time he did his samples. Perhaps he can clarify. When our cache was
not full, the times for RO/WR were pretty close at 4 mSec for both.
(Our cache size is 14GB (1M objects) and doing 50K TCP reqs/h, NOVM.18 train)

I guess that one of the reasons our open(WR) times are about 2x slower
than open(RO) is that they are truncating already existing files from
the pool of unused_files instead of waiting for unlink to happen.

Alright, thats about all on performance.

Now I'd like to speculate about some possible features that arise if we
can use we use tightly clustered store space (that is always fill
smallest numbered files first)

1) of course, we can skip precreation of directories, and do that on-demand.
2) we can dismiss L1/L2 configuration, and calculate optimal structure
    based on filesystem's block size only. (as this changes via newfs usually,
    we are quite safe from otherwise lethal L1/L2 reconfigurations)
3) basically, as we almost immediately overwrite released files, we can
    think of dropping unlinkd altogether.

4) After switching to md5 keys and keeping URL separately, we have some nice
(IMHO) possiblilities:
As Store_Entry is fixed size, good candidate for being used in array. Index
into array could be file_no, md5 keys stored outside Store_entry in a
separate db index, hash, giving md5-key -> file_no mappings only.

/*
BTW, md5 is already a very good hash, why use another hash algoritm on it
for key search? Why not use 10-16 lowest bits of md5 key as index into an
array of pointers to linked-lists where fine-grained compare can be done?
Having 1024 pointers takes 4Kb and linked-lists would be about 1000 items
deep each to accomodate 1M objects. Or, if take 65536 pointers, taking
256KB, each linked-list would be about 32-deep for 2M objects. Hash lookups
would be tremedously fast, i think.
*/

Now, what about creating fixed sized "swap_log" into each cache_dir, with
size holding max possible no of store_entries this particular cache_dir
can ever hold, .. and mmap-ing it into memory instead of reading in?

If it is array, we can mmap() it and use it right away, dynamic structures
can be attached via malloc when needed, and pointers initialized to NULL
during startups. This file never shrinks or gets larger, upper entrys
that are not used are not paged in, thus do not take ram, and all this
array does not take VM (sort of), because the file itself is swapspace for
data in it. No need to malloc/free from heap, no need to bother writing
to disk, OS is syncing dirty pages to disks in background. In theory,
squid could stand even totally sudden crash, unwritten store will be
flushed independantly. Of course, there is added risk in possibly
corrupting this mmaped memory area, but such risk is always present anyway.
Linked-list' *next could be just next file_no instead of pointer.
I think that we could reduce in-memory data size alot this way.

Wild. But is it crazy? ;)

----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:46 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:42 MST