Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-31 10:58:58


On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote:

> > FWIW, George found what looks like a race condition in the sm init
> > code today -- it looks like we don't call maffinity anywhere in the
> > sm btl startup, so we're not actually guaranteed that the memory is
> > local to any particular process(or) (!). This race shouldn't cause
> > segvs, though; it should only mean that memory is potentially
> farther
> > away than we intended.
>
> Is this that business that came up recently on one of these mail lists
> about setting the memory node to -1 rather than using the value we
> know
> it should be? In mca_mpool_sm_alloc(), I do see a call to
> opal_maffinity_base_bind().
>

No, it was a different thing -- but we missed the call to maffinity in
mpool sm. So that might make George's point moot (I see he still
hasn't chimed in yet on this thread, perhaps that's why ;-) ).

To throw a little flame on the fire -- I notice the following from an
MTT run last night:

[svbu-mpi004:17172] *** Process received signal ***
[svbu-mpi004:17172] Signal: Segmentation fault (11)
[svbu-mpi004:17172] Signal code: Invalid permissions (2)
[svbu-mpi004:17172] Failing at address: 0x2a98a3f080
[svbu-mpi004:17172] [ 0] /lib64/tls/libpthread.so.0 [0x2a960695b0]
[svbu-mpi004:17172] [ 1] /home/jsquyres/bogus/lib/openmpi/
mca_btl_sm.so [0x2a97f22619]
[svbu-mpi004:17172] [ 2] /home/jsquyres/bogus/lib/openmpi/
mca_btl_sm.so [0x2a97f225ee]
[svbu-mpi004:17172] [ 3] /home/jsquyres/bogus/lib/openmpi/
mca_btl_sm.so [0x2a97f22946]
[svbu-mpi004:17172] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so.
0(opal_progress+0xa9) [0x2a95bbc078]
[svbu-mpi004:17172] [ 5] /home/jsquyres/bogus/lib/libmpi.so.0
[0x2a95831324]
[svbu-mpi004:17172] [ 6] /home/jsquyres/bogus/lib/libmpi.so.0
[0x2a9583185b]
[svbu-mpi004:17172] [ 7] /home/jsquyres/bogus/lib/openmpi/
mca_coll_tuned.so [0x2a987e45be]
[svbu-mpi004:17172] [ 8] /home/jsquyres/bogus/lib/openmpi/
mca_coll_tuned.so [0x2a987f160b]
[svbu-mpi004:17172] [ 9] /home/jsquyres/bogus/lib/openmpi/
mca_coll_tuned.so [0x2a987e4c2e]
[svbu-mpi004:17172] [10] /home/jsquyres/bogus/lib/libmpi.so.
0(PMPI_Barrier+0xd7) [0x2a9585987f]
[svbu-mpi004:17172] [11] src/MPI_Type_extent_types_c(main+0xa20)
[0x402f88]
[svbu-mpi004:17172] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x2a9618e3fb]
[svbu-mpi004:17172] [13] src/MPI_Type_extent_types_c [0x4024da]
[svbu-mpi004:17172] *** End of error message ***

Notice the "invalid permissions" message. I didn't notice that
before, but perhaps I wasn't looking.

I also see that this is under coll_tuned, not coll_hierarch (i.e.,
*not* during MPI_INIT -- it's in a barrier).

> > The central question is: does "first touch" mean both read and
> > write? I.e., is the first process that either reads *or* writes
> to a
> > given location considered "first touch"? Or is it only the first
> write?
>
> So, maybe the strategy is to create the shared area, have each process
> initialize its portion (FIFOs and free lists), have all processes
> sync,
> and then move on. That way, you know all memory will be written by
> the
> appropriate owner before it's read by anyone else. First-touch
> ownership will be proper and we won't be dependent on zero-filled
> pages.
>

That was what George was going at yesterday -- there's a section in
the btl sm startup where you're setting up your own FIFOs. But then
there's a section later where you're looking at your peers' FIFOs.
There's no synchronization between these two points -- when you're
looking at your peers' FIFOs, you can tell if they're not setup yet by
if the peer's FIFO is NULL or not. If it's NULL, you loop and try
again (until it's not NULL). This is what George thought might be
"bad" from a maffinity standpoint -- but perhaps this is moot if mpool
sm is calling maffinity bind.

> The big question in my mind remains that we don't seem to know how to
> reproduce the failure (segv) that we're trying to fix. I, personally,
> am reluctant to stick fixes into the code for problems I can't
> observe.
>

Well, we *can* observe them -- I can reproduce them at a very low rate
in my MTT runs. We just don't understand the problem yet to know how
to reproduce them manually. To be clear: I'm violently agreeing with
you: I want to fix the problem, but it would be much mo' betta to
*know* that we fixed the problem rather than "well, it doesn't seem to
be happening anymore." :-)

-- 
Jeff Squyres
Cisco Systems