Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-31 15:06:05


Jeff Squyres wrote:

> On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote:
>
>> > FWIW, George found what looks like a race condition in the sm init
>> > code today -- it looks like we don't call maffinity anywhere in the
>> > sm btl startup, so we're not actually guaranteed that the memory is
>> > local to any particular process(or) (!). This race shouldn't cause
>> > segvs, though; it should only mean that memory is potentially
>> farther
>> > away than we intended.
>>
>> Is this that business that came up recently on one of these mail lists
>> about setting the memory node to -1 rather than using the value we know
>> it should be? In mca_mpool_sm_alloc(), I do see a call to
>> opal_maffinity_base_bind().
>
> No, it was a different thing -- but we missed the call to maffinity
> in mpool sm. So that might make George's point moot (I see he still
> hasn't chimed in yet on this thread, perhaps that's why ;-) ).
>
> To throw a little flame on the fire -- I notice the following from an
> MTT run last night:
>
> [svbu-mpi004:17172] *** Process received signal ***
> [svbu-mpi004:17172] Signal: Segmentation fault (11)
> [svbu-mpi004:17172] Signal code: Invalid permissions (2)
> [svbu-mpi004:17172] Failing at address: 0x2a98a3f080
> [svbu-mpi004:17172] [ 0] /lib64/tls/libpthread.so.0 [0x2a960695b0]
> [svbu-mpi004:17172] [ 1] /home/jsquyres/bogus/lib/openmpi/
> mca_btl_sm.so [0x2a97f22619]
> [svbu-mpi004:17172] [ 2] /home/jsquyres/bogus/lib/openmpi/
> mca_btl_sm.so [0x2a97f225ee]
> [svbu-mpi004:17172] [ 3] /home/jsquyres/bogus/lib/openmpi/
> mca_btl_sm.so [0x2a97f22946]
> [svbu-mpi004:17172] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so.
> 0(opal_progress+0xa9) [0x2a95bbc078]
> [svbu-mpi004:17172] [ 5] /home/jsquyres/bogus/lib/libmpi.so.0
> [0x2a95831324]
> [svbu-mpi004:17172] [ 6] /home/jsquyres/bogus/lib/libmpi.so.0
> [0x2a9583185b]
> [svbu-mpi004:17172] [ 7] /home/jsquyres/bogus/lib/openmpi/
> mca_coll_tuned.so [0x2a987e45be]
> [svbu-mpi004:17172] [ 8] /home/jsquyres/bogus/lib/openmpi/
> mca_coll_tuned.so [0x2a987f160b]
> [svbu-mpi004:17172] [ 9] /home/jsquyres/bogus/lib/openmpi/
> mca_coll_tuned.so [0x2a987e4c2e]
> [svbu-mpi004:17172] [10] /home/jsquyres/bogus/lib/libmpi.so.
> 0(PMPI_Barrier+0xd7) [0x2a9585987f]
> [svbu-mpi004:17172] [11] src/MPI_Type_extent_types_c(main+0xa20)
> [0x402f88]
> [svbu-mpi004:17172] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x2a9618e3fb]
> [svbu-mpi004:17172] [13] src/MPI_Type_extent_types_c [0x4024da]
> [svbu-mpi004:17172] *** End of error message ***
>
> Notice the "invalid permissions" message. I didn't notice that
> before, but perhaps I wasn't looking.
>
> I also see that this is under coll_tuned, not coll_hierarch (i.e.,
> *not* during MPI_INIT -- it's in a barrier).

Yes, actually these happen "a lot". (I've been spending time looking at
IU_Sif/r20880 MTT stack traces.)

If the stack trace has MPI_Init in it, it seems to be going through
mca_coll_hierarch.

Otherwise, the seg fault is in a collective call as you note -- could be
MPI_Allgather, Barrier, Bcast, and I imagine there are others -- then
mca_coll_tuned and eventually down to the sm BTL.

There are also quite a bit of orphaned(?) stack traces. Just a segfault
and a single-level stack a la
[ 0] /lib/libpthread.so

>> > The central question is: does "first touch" mean both read and
>> > write? I.e., is the first process that either reads *or* writes
>> to a
>> > given location considered "first touch"? Or is it only the first
>> write?
>>
>> So, maybe the strategy is to create the shared area, have each process
>> initialize its portion (FIFOs and free lists), have all processes sync,
>> and then move on. That way, you know all memory will be written by the
>> appropriate owner before it's read by anyone else. First-touch
>> ownership will be proper and we won't be dependent on zero-filled
>> pages.
>
> That was what George was going at yesterday -- there's a section in
> the btl sm startup where you're setting up your own FIFOs. But then
> there's a section later where you're looking at your peers' FIFOs.
> There's no synchronization between these two points -- when you're
> looking at your peers' FIFOs, you can tell if they're not setup yet
> by if the peer's FIFO is NULL or not. If it's NULL, you loop and
> try again (until it's not NULL). This is what George thought might
> be "bad" from a maffinity standpoint -- but perhaps this is moot if
> mpool sm is calling maffinity bind.

The thing I was wondering about was memory barriers. E.g., you
initialize stuff and then post the FIFO pointer. The other guy sees the
FIFO pointer before the initialized memory.

>> The big question in my mind remains that we don't seem to know how to
>> reproduce the failure (segv) that we're trying to fix. I, personally,
>> am reluctant to stick fixes into the code for problems I can't observe.
>
> Well, we *can* observe them -- I can reproduce them at a very low
> rate in my MTT runs. We just don't understand the problem yet to
> know how to reproduce them manually. To be clear: I'm violently
> agreeing with you: I want to fix the problem, but it would be much
> mo' betta to *know* that we fixed the problem rather than "well, it
> doesn't seem to be happening anymore." :-)