[PATCH] Re: r13295: cache_mem + rock store = "memCopy: could not find start of [x,y) in memory"

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Wed, 19 Mar 2014 08:27:25 -0600

Hello,

    The attached patch avoids "FATAL: Squid has attempted to read data
from memory that is not present" crashes when shared memory caching is
enabled in trunk. It also improves related code.

Shared memory caching code was not checking whether the response is
generally cachable and, hence, tried to store responses that Squid
already refused to cache (e.g., release-requested entries). The shared
memory cache should also check that the response is contiguous because
the disk store may not check that (e.g., when disk caching is disabled).

The problem was exacerbated by the broken entry dump code accompanying
the FATAL message. The Splay tree iterator is evidently not iterating a
portion of the tree. I added a visitor pattern code to work around that,
but did not try to fix the Splay iterator itself because, in part, the
whole iterator design felt inappropriate (storing and flattening the
tree before iterating it!?) for its intended purpose. I could not check
quickly enough where the broken iterator is used besides
mem_hdr::debugDump() so more bugs like this are possible. Checking and
fixing this would be a nice compact project for a volunteer looking for
such projects.

Optimized StoreEntry::checkCachable() a little and removed wrong TODO
comment: Disk-only mayStartSwapOut() should not accumulate "general"
cachability checks which are used by RAM caches as well.

Added more mem_hdr debugging and polished method descriptions.

Fixed NullStoreEntry::mayStartSwapout() spelling/case. It should
override StoreEntry::mayStartSwapOut().

The patch is against trunk with Vector migration removed (to avoid
random crashes) and fatal() replaced with fatal_dump() as discussed on
this thread. I will commit the fatal_dump() change separately if nobody
beats me to it.

Thank you,

Alex.

On 03/04/2014 11:44 AM, Nikolai Gorchilov wrote:
> On Tue, Mar 4, 2014 at 7:50 PM, Alex Rousskov wrote:
>> On 03/04/2014 09:44 AM, Nikolai Gorchilov wrote:
>>> On Tue, Mar 4, 2014 at 6:01 PM, Alex Rousskov wrote:
>>>> On 03/03/2014 04:03 PM, Niki Gorchilov wrote:
>>>>
>>>>> FATAL: Squid has attempted to read data from memory that is not
>>>>> present. This is an indication of of (pre-3.0) code that hasn't been
>>>>> updated to deal with sparse objects in memory. Squid should
>>>>> coredump.allowing to review the cause.
>>>>
>>>> I have seen one manifestation of such a bug in Rock and fixed it. IIRC,
>>>> trunk r13295 you are using has that fix, but I am not 100% certain and
>>>> will need to research that. Getting a back trace may be useful.
>>>>
>>>>
>>>>> Unfortunately, no core dumps are available for unknown reason.
>>>>
>>>> That part I am sure about since I faced the same problem: The fatal()
>>>> call that src/stmem.cc code printing the above FATAL message is using
>>>> does not trigger a core dump; it just exits. We should be calling
>>>> fatal_dump() there instead.
>>>>
>>>
>>> OK. I've patched src/stmem.cc to use fatal_dump() and will try to
>>> collect a core during the night.
>>> Do you need debug ALL,9? With 150 mbps of traffic, it will be a huge
>>> log for just few seconds of operation I guess.
>>
>>
>> If you can collect an ALL,9 cache.log, please do so and keep it in case
>> it is needed later. If you cannot collect it (for any reason) or its
>> collection stops Squid from failing, then let's just get the backtrace.
>
> OK. cache.log and core dump collected and available on request.
>
> Here's the backtrace:
>
> (gdb) bt
> #0 0x00007f544dfcf425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1 0x00007f544dfd2b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2 0x0000000000539605 in fatal_dump (
> message=0x764428 "Squid has attempted to read data from memory
> that is not present. This is an indication of of (pre-3.0) code that
> hasn't been updated to deal with sparse objects in memory. Squid
> should coredump.allow"...)
> at fatal.cc:134
> #3 0x00000000005e3b56 in mem_hdr::copy (this=0x2a282e0, target=...)
> at stmem.cc:272
> #4 0x000000000060d794 in MemStore::copyToShmSlice (this=<optimized
> out>, e=..., anchor=...) at MemStore.cc:564
> #5 0x000000000060f22e in MemStore::copyToShm (this=0x27d5850, e=...)
> at MemStore.cc:540
> #6 0x000000000060f405 in MemStore::write (this=0x27d5850, e=...) at
> MemStore.cc:663
> #7 0x00000000005f7c88 in StoreController::memoryOut (this=<optimized
> out>, e=..., preserveSwappable=false) at store_dir.cc:889
> #8 0x0000000000600342 in StoreEntry::swapOut (this=0x2a260f0) at
> store_swapout.cc:206
> #9 0x00000000005f25a6 in StoreEntry::invokeHandlers (this=0x2a260f0)
> at store_client.cc:746
> #10 0x00000000005e6429 in StoreEntry::write (this=0x2a260f0,
> writeBuffer=...) at store.cc:863
> #11 0x00000000006049e1 in ServerStateData::storeReplyBody
> (this=0x2a7d8d8, data=<optimized out>, len=13140) at Server.cc:993
> #12 0x0000000000573b0f in HttpStateData::writeReplyBody
> (this=0x2a7d8d8) at http.cc:1369
> #13 0x00000000005764b5 in HttpStateData::processReplyBody
> (this=0x2a7d8d8) at http.cc:1435
> #14 0x000000000057b63a in HttpStateData::readReply (this=0x2a7d8d8,
> io=...) at http.cc:1235
> #15 0x000000000057d8db in JobDialer<HttpStateData>::dial
> (this=0x2a83730, call=...) at base/AsyncJobCalls.h:166
> #16 0x00000000006626a9 in AsyncCall::make (this=0x2a83700) at AsyncCall.cc:32
> #17 0x00000000006665fc in AsyncCallQueue::fireNext (this=<optimized
> out>) at AsyncCallQueue.cc:52
> #18 0x0000000000666950 in AsyncCallQueue::fire (this=0x24e9180) at
> AsyncCallQueue.cc:38
> #19 0x0000000000532bcc in EventLoop::runOnce (this=0x7fffa369b960) at
> EventLoop.cc:142
> #20 0x0000000000532d80 in EventLoop::run (this=0x7fffa369b960) at
> EventLoop.cc:104
> #21 0x00000000005a5185 in SquidMain (argc=<optimized out>,
> argv=<optimized out>) at main.cc:1532
> #22 0x00000000004beefb in SquidMainSafe (argv=<optimized out>,
> argc=<optimized out>) at main.cc:1260
> #23 main (argc=<optimized out>, argv=<optimized out>) at main.cc:1252
>
> Hope this helps.
>
> Niki
>

Received on Wed Mar 19 2014 - 14:27:39 MDT

This archive was generated by hypermail 2.2.0 : Thu Mar 20 2014 - 12:00:13 MDT