On Thursday 22 November 2001 01.04, Joe Cooper wrote:
> Let me see if I get what you're pointing at by the term 'log structured':
>
> The log stores the meta-data and object location in a fixed position on
> disk (pre-allocated large enough for the full cache_dir).
Not quite.
log structured != fixed position on the drive.
log structured == the storage space is viewed as an virtually infinite sized
sequential log. Any changes to the filesystem is stored in the tail of the
log. Active portions of this virtual log is then mapped to your actual
storage media.
For a normal filesystem this obviously has some drawbacks since drives are
not infinite in size. The solution is called log cleaning where you have a
background job processing the log, reorganizing the information to reclaim
storage media space. The beauty of a cache is that the information is
voilatile, allowing us to simply drop the tail of the log (with some minor
exceptions), avoiding the need to restructure information to reclaim physical
space. The tail of the log contains information on the oldest objects in the
cache, and we can thus simply reclaim that space without having to worry
about deleting something of great value..
The same cannot be said for a traditional filesystem.. users would most
likely not be very impressed if the filesystem forgot about his oldest files,
but who knows... it could be viewed as a feature B)
So now you may thing that this log model where space is reclaimed in the tail
reduces the cache policy to FIFO. This is correct. However, there are some
hacks one can do to approach a LRU policy, and combined with the fact that
drives are growing in size faster than in speed this issue is getting less of
an issue by time.. Also, if one wants to be really fancy, then one can
utilize the mentioned idletime to implement almost any removal policy at the
cost of some I/O (thus still probably less than the I/O traditional UNIX
filesystems require for deleting/truncating files when implementing the same
policy..)
> Object blocks are pre striped onto the RAW disk (possibly allocated and
> used in bigger than FS block-size stripes), or simply 'imagined' on the
> disk by the Squid log system if we are still sitting atop a traditional
> UFS.
Objects are packed into the log, which is striped out on the disk in suitable
chunks when needed to make room for more objects in the pre-write buffer.
> The writer threads order object writes to fill in expired spots on the
> disk near where the reader threads are going to be acting next.
Such optimizations of block allocation can be applied, provided there is free
disk space nearby. But it is often better to keep the writes more ordered to
avoid excessive fragmentation of the storage media. The penalty is relatively
low as that single physical write operation contains hundreds if not more
filesystem operations..
> Possibly a separate logging thread is notified of the completion of the
> object write when the write thread returns, so that it can be added to
> the log.
There is only a log.. the log is everything.
> If there is currently no 'easy' way to write the object the writer
> thread buffers it, or drops it based on write buffer pressure at the
> time and potentially other object characteristics (perhaps shorter
> expires time, or current popularity while it's in the mem cache). This
> would happen if the reads are saturating the disk bandwidth, or if there
> is no read 'near' the spot we want to put the new object in.
The filesystem manager collects changed information (objects, metadata, ...)
until there is sufficient amount of data for an efficient write operation,
limited by the size of the pre-write buffer, or if there is a forced
synchronous operation (for which there is absolutely no need in a cache).
During this collection phase objects are also reorganized to make best
possible use of reference locality for later reads. When there is need to
write out data changes are packed into the log and written out in big chunks.
> Am I about right, or are you envisioning a more traditional logged FS
> that logs data alongside metadata into the log area in big chunks, which
> can then be written out to the normal disk area as read movement allows?
> I lean towards the former...but I expect both would be effective.
I envision the more traditional log structured FS but without the main
drawbacks seen when applying log structure to a traditional persistent
filesystem. Please note that in a log structured FS there is only the log.
The log is the data area.
Note: It is perhaps not correct to call a structured FS as traditional. There
has only been a handful of log structured filesystems implemented, and most
are more or less dead today as it does not fit very well with traditional
data storage, and thus perhaps not getting the attention they deserve..
The perhaps most well known log structured filesystem is LFS from *BSD
(written before the fork), and is a good example of the above fait.. however,
the approach taken by LFS does not fit that well for a cache as LFS has to
assume the constraints of a traditional filesystem, which a cache filesystem
does not need to do.
The main ideas for log structured filesystems was presented around 1987-1990.
First as an approach on how to store a writeable filesystem on a write-once
media, then as a viable approach in how to optimize a filesystem for writing.
I did not know about "log structured filesystems" when I first presented the
ideas here on squid-dev. It was someone here that showed me the light to this
quite interesting but less known area of computing research.
A quite good introduction to the subject is
Beating the IO Bottleneck: A Case for Log-Structured File Systems,
Ousterhout & Douglis
Can be found at http://collective.cpoint.net/lfs/papers/lfs-case.ps.gz
An quite extensive collection of papers on the subject can be found from the
same Linux Log-Structured Filesystem project page
<http://collective.cpoint.net/lfs/>, even if the project as such died long
before becoming anything remotely useful... There is a couple of partly
relevant papers not on his list, but all the major ones are there.
> Anyway, pretty cool. To be honest, though, at this point I think the
> CPU usage problem is one of the biggest that Squid faces. When we can
> get 80-100 reqs/sec from a single IDE disk (dependent on workload), and
> only 130 from two, and only 150 from three or four or five--even though
> they each have a dedicated bus...the law of diminishing returns is
> getting pretty ugly and it really hurts scalability. Of course, CPU
> could be saved somewhat by reducing disk overhead.
Yes, I know, and is why I am also tinkering with the eventio model, and
trying to find how one can make a caching proxy SMP scaleable.
eventio allows for experimenting with efficient networking designs.
the filesystem design including metadata makes it easier for SMP systems with
multiple writers to manage a shared cache.
> True. I like the idea of COSS for a lot of reasons...And I'm doubtful
> that any more traditional FS type could do better for small objects.
> The ideal balance, I expect, will be from the following layout:
>
> A real hot object memory cache with lazy flushes to -> disk I/O layer
>
> Disk I/O then selects either COSS or a more traditional FS based on
> object size (small objects go to COSS, while big objects feed into a UFS
> or logged FS).
Or COSS is fixed to also be able to deal with large objects.. and you only
use COSS..
But there may be gains from a policy point in separating large and small
objects from each other. It is quite likely one wants to apply another
storage and removal policy to large objects than to small ones.. i.e. store
them on larger and cheaper media, and keep them longer as the benefit from a
single large hit is quite large at a low operation cost, but less frequent.
> Lazy writes can be implemented pretty easily, I think, and would give a
> pretty good improvement to throughput while hopefully only marginally
> impacting hit rate. Right now, Squid is compelled to write every object
> to disk, leading to a /lot/ of outgoing disk bandwidth for objects that
> as you said never get used. Someone on squid-users mentioned last week
> that he had implemented 'picky writes' for his accelerators which is
> similar--it writes an object only if the object is accessed a second
> time while it still resides in the cache_mem.
Reminds me of the flat layered hierarchy model I was tinkering with some
years ago..
N frontend proxies doing the dirty work of network I/O and hot object caching.
A smaller (1, maybe two) set of larger backend proxies to where cache hits
from the frontend proxies are sent. These backends keeps track of the content
of all frontends and sucks the object from there if there is a hit on another
frontend not having the object, and also keeps a longer history about object
validity than the frontends is capable of manage. The idea is that the
backend proxy only collects objects that has been requested more than once
within a given timeframe (days, up to a week). It then stores these objects
quite long in the hope that they will get accessed again.
Now, the situation in disk storage sizes have made this less of an
interesting approach.
> This is a simple way to achieve it, and probably just as effective
> as more complicated flushing decisions. This, of course, requires
> a reasonable sized cache_mem and a policy to decide what to write
> and when. I will look into adding a 'picky writes' option to Squid
> to give it some testing.
With a good FS designed for a cache workload there is little reason for
'picky writes' unless you want to save the disk area. There is plenty of
bandwidth to modern drives, and space is also plenty. It is mostly I/O
operations/s the drives are short of. And the ratio between I/O bandwidth,
space and operations/s is steadily increasing making operations/s a bigger
and bigger relative bottleneck for each drive generation... thus it is
getting increasingly important to find solutions that tries to utilize what
the drives are good at to maintain a good cost effectiveness.
Regards
Henrik
Received on Wed Nov 21 2001 - 19:03:19 MST
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:38 MST