Re: /bzr/squid3/trunk/ r11740: Do not let cache manager requests kill SMP Squid using isOpen() assertion.

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Wed, 21 Sep 2011 11:20:09 -0600

On 09/17/2011 10:42 AM, Alex Rousskov wrote:
> On 09/17/2011 02:01 AM, Amos Jeffries wrote:
>> On 16/09/11 16:29, Alex Rousskov wrote:
>>> On 09/15/2011 08:24 PM, Amos Jeffries wrote:
>>>> On 16/09/11 06:16, Alex Rousskov wrote:
>>>>> ------------------------------------------------------------
>>>>> revno: 11740
>>>>> committer: Alex Rousskov<rousskov_at_measurement-factory.com>
>>>>> branch nick: trunk
>>>>> timestamp: Thu 2011-09-15 12:16:59 -0600
>>>>> message:
>>>>> Do not let cache manager requests kill SMP Squid using isOpen()
>>>>> assertion.
>>>>>
>>>>> As the comment above the close call implies, we have not imported
>>>>> the foreign socket descriptor into our fd_table yet. We must use
>>>>> raw close(2), just like the corresponding
>>>>> Mgr::Request::Request(msg) code that allocates request.conn, uses
>>>>> raw assignment to give that half-baked connection a descriptor.
>>>>>
>>>>> TODO: This direct manipulation of Connection::fd is ugly, and this
>>>>> half-baked connection will most likely cause more [hidden] problems
>>>>> down the road. For example, Mgr::Request destructor will assert in
>>>>> a similar way if the request object is destroyed before
>>>>> Action::respond() is called.>> modified:
>>>>> src/mgr/Action.cc
>>>
>>>
>>>> Problem is: why is the FD getting outside the IPC Coordinator/Strand
>>>> scope when it is not "valid" according to fd_table?
>>>
>>>
>>> It is not 100% clear where that IPC scope ends in this case, but the
>>> problem is not really with the scope borders. The bug was due to the
>>> [old] action creator saying "we are not going to import this raw fd so
>>> that each specific action can decide whether it actually needs an fd"
>>> and the [new] default action code saying "we are going to close this
>>> Comm connection" even though the Connection object was in that
>>> half-baked state.
>>>
>>>
>>>> - If it can't be read/write/close, it should not really be passed
>>>> round
>>>> at all.
>>>> - If it *can* be read/write/close, the ->close() is the right way to
>>>> close and remove the possible handlers.
>>>
>>> It can be read/written/closed, but it can be read/written/closed using
>>> Comm only after the fd is imported into Comm space. The old code did not
>>> import and used raw close(2) by default. The new code still did not
>>> import but used Comm close by default.
>>>
>>>
>>>> Correlation of these is that the code doing conn->fd = msg.getFd() when
>>>> unpacking the TypedMsgHdr should be (a) not, or (b) setting up fd_table
>>>> state sufficiently to match the Comm::Connection state.
>>>
>>
>>
>> Option (1)
>>
>>> Yes, that is why I left a TODO note: We can change the old code to
>>> import the fd into Comm [and create a full Connection object] as soon as
>>> possible. It would be a waste of cycles in most cases, but it will
>>> prevent half-baked Connection objects from lingering around and causing
>>> more asserts or fd leaks.
>>>
>>
>> option (2)
>>> The alternative is to use a "raw fd holder" class/object that would use
>>> raw close(2) unless a fully-baked Connection object is created. SSL code
>>> uses that approach for many temporary SSL things to avoid leaks.
>>
>> option (3):
>> Comm::Connection::close() could detect the fd_table flag which is
>> asserting right? then it could use ::close() instead of comm_close().
>
> I do not like option (3) because it will prevent us from detecting bugs
> where a regular fd is closed twice (from two Connection objects) and
> other code logic problems.
>
>
>> option (4):
>> Or probably, add another Comm::Connection flag to indicate "partial
>> import" for the opening TCP flags. Which triggers the particular closing
>> method if still set on ->close().
>
>
>>> Since this is not a performance-sensitive area of the code, the former
>>> (i.e., create a Connection object ASAP) may be better/simpler.
>>
>> Well, like you keep reminding me. Everything is performance sensitive
>> indirectly.
>
> I think that would be an exaggeration. It is difficult to imagine a
> realistic scenario where we need to super-optimize cache manager
> queries, but if such a need does surface, we can always migrate from
> simpler to more complex (option #1 to another option).
>
>
>> So I'm thinking option (4), with option (1) in specific cases would be
>> best.
>> Option (3) whenever we fiddle with the fd directly. Option (1)
>> whenever its known that the FD will be used. Possibly some time after
>> option (3) was done, so unsetting the half-imported flag if set earlier.
>> This way we avoid fd_table and related CPU consumptions entirely for
>> most of these. And still roughly follow the semantics of what the flags
>> mean: things done when opening this FD, so we can repeat the sequence in
>> comm_reset() as needed.
>
> I think we should keep it simple and use Option #1 for all cases. The
> fewer half-baked things we have, the better (this is not limited to
> asserts when closing a descriptor). If you insist, I can live with
> option #4.
>
>
> BTW, I think Ipc::ImportFdIntoComm() is now broken as well. If things go
> wrong, it calls conn->close() for a half-baked connection. If you work
> on this, please try to find all cases where Connection code changed
> close(2) to conn->close(). There may be more such bugs lurking there.

Found another similar assertion which exposed a related but different
problem.

Assertion: When Coordinator gets an IPC cache manager request for an
action that Coordinator cannot find, it asserts because there is no
Action::respond() call to close the half-baked connection. Both options
#1 and #4 would solve this.

Problem: Coordinator must know of all actions known to workers because
Coordinator cannot aggregate cache manager responses without an
action-specific Action object. Currently, there are actions Coordinator
does not know about but workers do (e.g., mgr:sourcehash). This is
probably due to some modules not being initialized in Coordinator
because it does not need them for any other purpose.

I can try to take care of the above Problem. Will you work on
implementing option #1 (or #4) to get rid of assertions?

Thank you,

Alex.
P.S. We also do not handle Coordinator asserting/quitting very well, but
that is a very different problem outside this thread scope.
Received on Wed Sep 21 2011 - 17:20:57 MDT

This archive was generated by hypermail 2.2.0 : Wed Sep 21 2011 - 12:00:03 MDT