On Wed, Nov 25, 2009 at 7:48 AM, Amos Jeffries <squid3_at_treenet.co.nz> wrote:
> On Tue, 24 Nov 2009 16:13:37 -0700, Alex Rousskov
> <rousskov_at_measurement-factory.com> wrote:
>> On 11/20/2009 10:59 PM, Robert Collins wrote:
>>> On Tue, 2009-11-17 at 08:45 -0700, Alex Rousskov wrote:
>>>>>> Q1. What are the major areas or units of asynchronous code
> execution?
>>>>>> Some of us may prefer large areas such as "http_port acceptor" or
>>>>>> "cache" or "server side". Others may root for AsyncJob as the
> largest
>>>>>> asynchronous unit of execution. These two approaches and their
>>>>>> implications differ a lot. There may be other designs worth
>>>>>> considering.
>>
>>> I'd like to let people start writing (and perf testing!) patches. To
>>> unblock people. I think the primary questions are:
>>> - do we permit multiple approaches inside the same code base. E.g.
>>> OpenMP in some bits, pthreads / windows threads elsewhere, and 'job
>>> queues' or some such abstraction elsewhere ?
>>> (I vote yes, but with caution: someone trying something we don't
>>> already do should keep it on a branch and really measure it well until
>>> its got plenty of buy in).
>>
>> I vote for multiple approaches at lower levels of the architecture and
>> against multiple approaches at highest level of the architecture. My Q1
>> was only about the highest levels, BTW.
>>
>> For example, I do not think it is a good idea to allow a combination of
>> OpenMP, ACE, and something else as a top-level design. Understanding,
>> supporting, and tuning such a mix would be a nightmare, IMO.
>>
>> On the other hand, using threads within some disk storage schemes while
>> using processes for things like "cache" may make a lot of sense, and we
>> already have examples of some of that working.
>>
>
> OpenMP seems almost unanimous negative by the people who know it.
>
OK
>>
>> This is why I believe that the decision of processes versus threads *at
>> the highest level* of the architecture is so important. Yes, we are,
>> can, and will use threads at lower levels. There is no argument there.
>> The question is whether we can also use threads to split Squid into
>> several instances of "major areas" like client side(s), cache(s), and
>> server side(s).
>>
>> See Henrik's email on why it is difficult to use threads at highest
>> levels. I am not convinced yet, but I do see Henrik's point, and I
>> consider the dangers he cites critical for the right Q1 answer.
>>
>>
>>> - If we do *not* permit multiple approaches, then what approach do we
>>> want for parallelisation. E.g. a number of long lived threads that take
>>> on work, or many transient threads as particular bits of the code need
>>> threads. I favour the former (long lived 'worker' threads).
>>
>> For highest-level models, I do not think that "one job per
>> thread/process", "one call per thread/process", or any other "one little
>> short-lived something per thread/process" is a good idea. I do believe
>> we have to parallelize "major areas", and I think we should support
>> multiple instances of some of those "areas" (e.g., multiple client
>> sides). Each "major area" would be long-lived process/thread, of course.
>
> Agreed. mostly.
>
> As Rob points out the idea is for one small'ish pathway of the code to be
> run N times with different state data each time by a single thread.
>
> Sachins' initial AcceptFD thread proposal would perhapse be exemplar for
> this type of thread. Where one thread does the comm layer; accept() through
> to the scheduling call hand-off to handlers outside comm. Then goes back
> for the next accept().
>
> The only performance issue brought up was by you that its particular case
> might flood the slower main process if done first. Not all code can be done
> this way.
>
> Overheads are simply moving the state data in/out of the thread. IMO
> starting/stopping threads too often is a fairly bad idea. Most events will
> end up being grouped together into types (perhapse categorized by
> component, perhapse by client request, perhapse by pathway) with a small
> thread dedicated to handling that type of call.
>
>>
>> Again for higher-level models, I am also skeptical that it is a good
>> idea to just split Squid into N mostly non-cooperating nearly identical
>> instances. It may be the right first step, but I would like to offer
>> more than that in terms of overall performance and tunability.
>
> The answer to that is: of all the SMP models we theorize, that one is the
> only proven model so far.
> Administrators are already doing it with all the instance management
> manually handled on quad+ core machines. With a lot of performance success.
>
> In last nights discussion on IRC we covered what issues are outstanding
> from making this automatic and all are resolvable except cache index. It's
> not easily shareable between instances.
>
>>
>> I hope the above explains why I consider Q1 critical for the meant
>> "highest level" scope and why "we already use processes and threads" is
>> certainly true but irrelevant within that scope.
>>
>>
>> Thank you,
>>
>> Alex.
>
> Thank you for clarifying that. I now think we are all more or less headed
> in the same direction(s). With three models proposed for the overall
> architecture.
>
> In the order they were brought up... (NP: the TODO only applies if we work
> towards that goal)
>
> MODEL: * fully threaded. some helper child processes
> PROS:
> smaller memory resource footprint when running.
>
> CONS:
> potentially larger CPU footprint swapping data between threads.
> potential problems making threaded paths too small vs the overheads.
>
> TODO:
> continue polishing the code into distinct calls
> determine thread-safe code
> determine shared data and add appropriate locking
> make above segments into threads.
> add some way to pass events/calls to existing long-term threads
> either ... a super-lock as described by Henrik,
> or ... a 2-queue alternative as described by Amos
>
>
> MODEL: * process chunks with sub-threads and sometimes helper child
> processes
> PROS:
> it's known to be very fast. but not amazingly so. (ref: postfix) (ref:
> squid helpers)
>
> CONS:
> current code uses a LOT of data sharing between components. particularly
> of small 1-32 byte chunks of random data (config flags, stats, shared cache
> data snippets).
> identifying distinct chunks is a big time consuming issue.
>
> TODO:
> identify the major process chunks and splitting out from the main binary
> add efficient ways to pass data between cleanly between processes (at
> capacity).
> copy relevant external shared data into the state objects to pass along
> with the request data
> plus all same TODO from fully-threaded model, for the sub-threads within
> each process.
>
>
> MODEL: * separate instances with sub-threads and helper child processes
> PROS:
> we can almost do the macro change today. (sub-threads later)
> it can scale the base app speed up a reasonable percentage (ref:
> apache2)
>
> CONS:
> duplication of data. particularly in the storage. is very wasteful of
> resources.
> NP: apache evade this with effectively read-only disk data, all dynamics
> are in the instance memory.
>
> TODO:
> the -I option needs porting so the master can open main ports and
> children share the listening.
> finish the logging TCP module ideas (for reliable shared logging).
> some code to make the master process handle multiple children.
> some alterations to safely handle the shared config file settings
> (cache_dir etc).
>
>
> MODEL: * status-quo.
> Where we continue to work on all the above TODOs as time permits and
> needs require. wait and see which model gets finished first.
>
> PROS:
> the way forward is already well known.
>
> CONS:
> it's not fast enough reaching multi-CPU usage
>
>
> The easiest way forward seems to be toward separate instances, with finer
> grained threading and/or process chunking being done later after deeper
> analysis for extra gains at each change.
AGREED....
>
> This makes me think that we are not in fact proposing competing models,
> but simply looking at different levels of code. Each approach which has
> come up may best be used at varying levels; upper (instances), middle
> (processes, threads, jobs), and low (signals, events, cbdata, async calls).
>
> It also seems to me the top instances choice is the most easily reversed
> if it's found to actually be a bad idea. The major support change being in
> the parent main() code setting up for several children instances.
> Possibilities there for configuring it on/off or how many instances.
>
>
> Amos
>
>
-- Mr. S. H. Malave Computer Science & Engineering Department, Walchand College of Engineering,Sangli. sachinmalave_at_wce.org.inReceived on Wed Nov 25 2009 - 06:50:00 MST
This archive was generated by hypermail 2.2.0 : Wed Nov 25 2009 - 12:00:10 MST