Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple
From: Dominique Orban (dominique.orban_at_[hidden])
Date: 2013-12-04 12:13:25


I built the 1.7.x nightly tar ball on 10.8 (Mountain Lion) and 10.9 (Mavericks) and it still hangs. I tried compiling with --enable-mpi-thread-multiple only and with the other options Pierre mentioned. The PETSc tests hang in both cases.

I'm curious to know if the nightly tar ball fixes the issue for other users.

On 2013-12-02, at 6:40 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:

> I'm joining this thread late, but I think I know what is going on:
>
> - I am able to replicate the hang with 1.7.3 on Mavericks (with threading enabled, etc.)
> - I notice that the hang has disappeared at the 1.7.x branch head (also on Mavericks)
>
> Meaning: can you try with the latest 1.7.x nightly tarball and verify that the problem disappears for you? See http://www.open-mpi.org/nightly/v1.7/
>
> Ralph recently brought over a major ORTE control message change to the 1.7.x branch (after 1.7.3 was released) that -- skipping lots of details -- changes how the shared memory bootstrapping works. Based on the stack traces you sent and the ones I was also able to get, I'm thinking that Ralph's big ORTE change fixes this issue.
>
>
>
> On Nov 25, 2013, at 10:52 PM, Dominique Orban <dominique.orban_at_[hidden]> wrote:
>
>>
>> On 2013-11-25, at 9:02 PM, Ralph Castain <rhc.openmpi_at_[hidden]> wrote:
>>
>>> On Nov 25, 2013, at 5:04 PM, Pierre Jolivet <jolivet_at_[hidden]> wrote:
>>>
>>>>
>>>> On Nov 24, 2013, at 3:03 PM, Jed Brown <jedbrown_at_[hidden]> wrote:
>>>>
>>>>> Ralph Castain <rhc_at_[hidden]> writes:
>>>>>
>>>>>> Given that we have no idea what Homebrew uses, I don't know how we
>>>>>> could clarify/respond.
>>>>>
>>>>
>>>> Ralph, it is pretty easy to know what Homebrew uses, c.f. https://github.com/mxcl/homebrew/blob/master/Library/Formula/open-mpi.rb (sorry if you meant something else).
>>>
>>> Might be a surprise, but I don't track all these guys :-)
>>>
>>> Homebrew is new to me
>>>
>>>>
>>>>> Pierre provided a link to MacPorts saying that all of the following
>>>>> options were needed to properly enable threads.
>>>>>
>>>>> --enable-event-thread-support --enable-opal-multi-threads --enable-orte-progress-threads --enable-mpi-thread-multiple
>>>>>
>>>>> If that is indeed the case, and if passing some subset of these options
>>>>> results in deadlock, it's not exactly user-friendly.
>>>>>
>>>>> Maybe --enable-mpi-thread-multiple is enough, in which case MacPorts is
>>>>> doing something needlessly complicated and Pierre's link was a red
>>>>> herring?
>>>>
>>>> That is very likely, though on the other hand, Homebrew is doing something pretty straightforward. I just wanted a quick and easy fix back when I had the same hanging issue, but there should be a better explanation if --enable-mpi-thread-multiple is indeed enough.
>>>
>>> It is enough - we set all required things internally
>>
>> Is that for sure? My original message originates from a hang in the PETSc tests and I get quite different results depending on whether I compile OpenMPI with --enable-mpi-thread-multiple only or not.
>>
>> I recompiled PETSc with debugging enabled against OpenMPI built with the "correct" flags mentioned by Pierre, and this the stack trace I get:
>>
>> $ mpirun -n 2 xterm -e gdb ./ex5
>>
>> ^C
>> Program received signal SIGINT, Interrupt.
>> 0x00007fff991160fa in __psynch_cvwait ()
>> from /usr/lib/system/libsystem_kernel.dylib
>> (gdb) where
>> #0 0x00007fff991160fa in __psynch_cvwait ()
>> from /usr/lib/system/libsystem_kernel.dylib
>> #1 0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
>>
>>
>> ^C
>> Program received signal SIGINT, Interrupt.
>> 0x00007fff991160fa in __psynch_cvwait ()
>> from /usr/lib/system/libsystem_kernel.dylib
>> (gdb) where
>> #0 0x00007fff991160fa in __psynch_cvwait ()
>> from /usr/lib/system/libsystem_kernel.dylib
>> #1 0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
>>
>>
>> If I recompile PETSc against OpenMPI built with --enable-mpi-thread-multiple only (leaving out the other flags, which Pierre suggested is wrong), I get the following traces:
>>
>> ^C
>> Program received signal SIGINT, Interrupt.
>> 0x00007fff991160fa in __psynch_cvwait ()
>> from /usr/lib/system/libsystem_kernel.dylib
>> (gdb) where
>> #0 0x00007fff991160fa in __psynch_cvwait ()
>> from /usr/lib/system/libsystem_kernel.dylib
>> #1 0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
>>
>>
>> ^C
>> Program received signal SIGINT, Interrupt.
>> 0x0000000101edca28 in mca_common_sm_init ()
>> from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
>> (gdb) where
>> #0 0x0000000101edca28 in mca_common_sm_init ()
>> from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
>> #1 0x0000000101ed8a38 in mca_mpool_sm_init ()
>> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_mpool_sm.so
>> #2 0x0000000101c383fa in mca_mpool_base_module_create ()
>> from /usr/local/lib/libmpi.1.dylib
>> #3 0x0000000102933b41 in mca_btl_sm_add_procs ()
>> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_btl_sm.so
>> #4 0x0000000102929dfb in mca_bml_r2_add_procs ()
>> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_bml_r2.so
>> #5 0x000000010290a59c in mca_pml_ob1_add_procs ()
>> from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_pml_ob1.so
>> #6 0x0000000101bd859b in ompi_mpi_init () from /usr/local/lib/libmpi.1.dylib
>> #7 0x0000000101bf24da in MPI_Init_thread () from /usr/local/lib/libmpi.1.dylib
>> #8 0x00000001000724db in PetscInitialize (argc=0x7fff5fbfed48,
>> args=0x7fff5fbfed40, file=0x0,
>> help=0x1000061c0 "Bratu nonlinear PDE in 2d.\nWe solve the Bratu (SFI - soid fuel ignition) problem in a 2D rectangular\ndomain, using distributed arrays(DMDAs) to partition the parallel grid.\nThe command line options"...)
>> at /tmp/petsc-3.4.3/src/sys/objects/pinit.c:675
>> #9 0x0000000100000d8c in main ()
>>
>>
>> Line 675 of pinit.c is
>>
>> ierr = MPI_Init_thread(argc,args,MPI_THREAD_FUNNELED,&provided);CHKERRQ(ierr);
>>
>>
>> Dominique
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Dominique