Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-12-02 18:40:03


I'm joining this thread late, but I think I know what is going on:

- I am able to replicate the hang with 1.7.3 on Mavericks (with threading enabled, etc.)
- I notice that the hang has disappeared at the 1.7.x branch head (also on Mavericks)

Meaning: can you try with the latest 1.7.x nightly tarball and verify that the problem disappears for you? See http://www.open-mpi.org/nightly/v1.7/

Ralph recently brought over a major ORTE control message change to the 1.7.x branch (after 1.7.3 was released) that -- skipping lots of details -- changes how the shared memory bootstrapping works. Based on the stack traces you sent and the ones I was also able to get, I'm thinking that Ralph's big ORTE change fixes this issue.

On Nov 25, 2013, at 10:52 PM, Dominique Orban <dominique.orban_at_[hidden]> wrote:

>
> On 2013-11-25, at 9:02 PM, Ralph Castain <rhc.openmpi_at_[hidden]> wrote:
>
>> On Nov 25, 2013, at 5:04 PM, Pierre Jolivet <jolivet_at_[hidden]> wrote:
>>
>>>
>>> On Nov 24, 2013, at 3:03 PM, Jed Brown <jedbrown_at_[hidden]> wrote:
>>>
>>>> Ralph Castain <rhc_at_[hidden]> writes:
>>>>
>>>>> Given that we have no idea what Homebrew uses, I don't know how we
>>>>> could clarify/respond.
>>>>
>>>
>>> Ralph, it is pretty easy to know what Homebrew uses, c.f. https://github.com/mxcl/homebrew/blob/master/Library/Formula/open-mpi.rb (sorry if you meant something else).
>>
>> Might be a surprise, but I don't track all these guys :-)
>>
>> Homebrew is new to me
>>
>>>
>>>> Pierre provided a link to MacPorts saying that all of the following
>>>> options were needed to properly enable threads.
>>>>
>>>> --enable-event-thread-support --enable-opal-multi-threads --enable-orte-progress-threads --enable-mpi-thread-multiple
>>>>
>>>> If that is indeed the case, and if passing some subset of these options
>>>> results in deadlock, it's not exactly user-friendly.
>>>>
>>>> Maybe --enable-mpi-thread-multiple is enough, in which case MacPorts is
>>>> doing something needlessly complicated and Pierre's link was a red
>>>> herring?
>>>
>>> That is very likely, though on the other hand, Homebrew is doing something pretty straightforward. I just wanted a quick and easy fix back when I had the same hanging issue, but there should be a better explanation if --enable-mpi-thread-multiple is indeed enough.
>>
>> It is enough - we set all required things internally
>
> Is that for sure? My original message originates from a hang in the PETSc tests and I get quite different results depending on whether I compile OpenMPI with --enable-mpi-thread-multiple only or not.
>
> I recompiled PETSc with debugging enabled against OpenMPI built with the "correct" flags mentioned by Pierre, and this the stack trace I get:
>
> $ mpirun -n 2 xterm -e gdb ./ex5
>
> ^C
> Program received signal SIGINT, Interrupt.
> 0x00007fff991160fa in __psynch_cvwait ()
> from /usr/lib/system/libsystem_kernel.dylib
> (gdb) where
> #0 0x00007fff991160fa in __psynch_cvwait ()
> from /usr/lib/system/libsystem_kernel.dylib
> #1 0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
>
>
> ^C
> Program received signal SIGINT, Interrupt.
> 0x00007fff991160fa in __psynch_cvwait ()
> from /usr/lib/system/libsystem_kernel.dylib
> (gdb) where
> #0 0x00007fff991160fa in __psynch_cvwait ()
> from /usr/lib/system/libsystem_kernel.dylib
> #1 0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
>
>
> If I recompile PETSc against OpenMPI built with --enable-mpi-thread-multiple only (leaving out the other flags, which Pierre suggested is wrong), I get the following traces:
>
> ^C
> Program received signal SIGINT, Interrupt.
> 0x00007fff991160fa in __psynch_cvwait ()
> from /usr/lib/system/libsystem_kernel.dylib
> (gdb) where
> #0 0x00007fff991160fa in __psynch_cvwait ()
> from /usr/lib/system/libsystem_kernel.dylib
> #1 0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
>
>
> ^C
> Program received signal SIGINT, Interrupt.
> 0x0000000101edca28 in mca_common_sm_init ()
> from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
> (gdb) where
> #0 0x0000000101edca28 in mca_common_sm_init ()
> from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
> #1 0x0000000101ed8a38 in mca_mpool_sm_init ()
> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_mpool_sm.so
> #2 0x0000000101c383fa in mca_mpool_base_module_create ()
> from /usr/local/lib/libmpi.1.dylib
> #3 0x0000000102933b41 in mca_btl_sm_add_procs ()
> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_btl_sm.so
> #4 0x0000000102929dfb in mca_bml_r2_add_procs ()
> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_bml_r2.so
> #5 0x000000010290a59c in mca_pml_ob1_add_procs ()
> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_pml_ob1.so
> #6 0x0000000101bd859b in ompi_mpi_init () from /usr/local/lib/libmpi.1.dylib
> #7 0x0000000101bf24da in MPI_Init_thread () from /usr/local/lib/libmpi.1.dylib
> #8 0x00000001000724db in PetscInitialize (argc=0x7fff5fbfed48,
> args=0x7fff5fbfed40, file=0x0,
> help=0x1000061c0 "Bratu nonlinear PDE in 2d.\nWe solve the Bratu (SFI - soid fuel ignition) problem in a 2D rectangular\ndomain, using distributed arrays(DMDAs) to partition the parallel grid.\nThe command line options"...)
> at /tmp/petsc-3.4.3/src/sys/objects/pinit.c:675
> #9 0x0000000100000d8c in main ()
>
>
> Line 675 of pinit.c is
>
> ierr = MPI_Init_thread(argc,args,MPI_THREAD_FUNNELED,&provided);CHKERRQ(ierr);
>
>
> Dominique
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/