Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] [EXTERNAL] Re: open-mpi on Mac OS 10.9 (Mavericks)
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-12-03 09:54:25


Ok, I think we're chasing the same thing in multiple threads -- this looks like a similar result as what you replied to Ralph with.

Let's keep the other thread (with Ralph) going; this looks like some kind of networking issue that we haven't seen before (e.g., unable to open ports to the local host). Which is a little odd, but let's run it down over in the other thread.

On Dec 3, 2013, at 7:44 AM, "Meredith, Karl" <karl.meredith_at_[hidden]> wrote:

> Using the latest nightly snapshot (1.7.4) and only Apple compilers/tools (no macports), I configure/build with the following:
>
> ./configure --prefix=/opt/trunk/apple-only-1.7.4 --enable-shared --disable-static --enable-debug --disable-io-romio --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default
> make all
> make install
> export PATH=/opt/trunk/apple-only-1.7.4/bin/:$PATH
> export LD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$LD_LIBRARY_PATH
> export DYLD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$DYLD_LIBRARY_PATH
> cd examples
> make all
> mpirun -v -np 2 ./hello_cxx
>
> Here’s the stack trace for one of the hanging process:
>
> (lldb) bt
> * thread #1: tid = 0x57052, 0x00007fff8c991a3a libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread, stop reason = signal SIGSTOP
> frame #0: 0x00007fff8c991a3a libsystem_kernel.dylib`__semwait_signal + 10
> frame #1: 0x00007fff8ade4e60 libsystem_c.dylib`nanosleep + 200
> frame #2: 0x0000000100be98e3 libopen-rte.6.dylib`orte_routed_base_register_sync(setup=true) + 2435 at routed_base_fns.c:344
> frame #3: 0x0000000100ecc3a7 mca_routed_binomial.so`init_routes(job=1305542657, ndat=0x0000000000000000) + 2759 at routed_binomial.c:708
> frame #4: 0x0000000100b9e84d libopen-rte.6.dylib`orte_ess_base_app_setup(db_restrict_local=true) + 2109 at ess_base_std_app.c:233
> frame #5: 0x0000000100e3a442 mca_ess_env.so`rte_init + 418 at ess_env_module.c:146
> frame #6: 0x0000000100b59cfe libopen-rte.6.dylib`orte_init(pargc=0x0000000000000000, pargv=0x0000000000000000, flags=32) + 718 at orte_init.c:158
> frame #7: 0x00000001008bd3c8 libmpi.1.dylib`ompi_mpi_init(argc=0, argv=0x0000000000000000, requested=0, provided=0x00007fff5f3cd370) + 616 at ompi_mpi_init.c:451
> frame #8: 0x000000010090b5c3 libmpi.1.dylib`MPI_Init(argc=0x0000000000000000, argv=0x0000000000000000) + 515 at init.c:86
> frame #9: 0x0000000100833a1d hello_cxx`MPI::Init() + 29 at functions_inln.h:128
> frame #10: 0x00000001008332ac hello_cxx`main(argc=1, argv=0x00007fff5f3cd550) + 44 at hello_cxx.cc:18
> frame #11: 0x00007fff8d5df5fd libdyld.dylib`start + 1
>
> Karl
>
>
> On Dec 2, 2013, at 2:33 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
>
>> Ah -- sorry, I missed this mail before I replied to the other thread (OS X Mail threaded them separately somehow...).
>>
>> Sorry to ask you to dive deeper, but can you find out where in orte_ess.init() it's failing? orte_ess.init is actually a function pointer; it's a jump-off point into a dlopen'ed plugin.
>>
>>
>> On Nov 25, 2013, at 11:53 AM, "Meredith, Karl" <karl.meredith_at_[hidden]> wrote:
>>
>>> Digging a little deeper by running the code in the lldb debugger, I found that the stall occurs in a call to init_orte from ompi_mpi_init.c:
>>> 356 /* Setup ORTE - note that we are an MPI process */
>>> 357 if (ORTE_SUCCESS != (ret = orte_init(NULL, NULL, ORTE_PROC_MPI))) {
>>> 358 error = "ompi_mpi_init: orte_init failed";
>>> 359 goto error;
>>> 360 }
>>>
>>> The code never returns from orte_init.
>>>
>>> It gets stuck in orte_ess.init() called from orte_init.c:
>>> 126 /* initialize the RTE for this environment */
>>> 127 if (ORTE_SUCCESS != (ret = orte_ess.init())) {
>>>
>>> When I step through this orte_ess_init in the lldb debugger, I actually get some output from the code (no output if not using the debugger and stepping through):
>>> --------------------------------------------------------------------------
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort. There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or environment
>>> problems. This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>>
>>> ompi_mpi_init: orte_init failed
>>> --> Returned "Unable to start a daemon on the local node" (-128) instead of "Success" (0)
>>>
>>>
>>>
>>> Karl
>>>
>>>
>>>
>>> On Nov 25, 2013, at 9:20 AM, Meredith, Karl <karl.meredith_at_[hidden]> wrote:
>>>
>>>> Here’s the back trace from lldb:
>>>> $ )ps -elf | grep hello
>>>> 1042653210 45231 45230 4006 0 31 0 2448976 2148 - S+ 0 ttys002 0:00.01 hello_cxx 9:07AM
>>>> 1042653210 45232 45230 4006 0 31 0 2457168 2156 - S+ 0 ttys002 0:00.04 hello_cxx 9:07AM
>>>>
>>>> (meredithk_at_meredithk-mac)-(09:15 AM Mon Nov 25)-(~/tools/openmpi-1.6.5/examples)
>>>> $ )lldb -p 45231
>>>> Attaching to process with:
>>>> process attach -p 45231
>>>> Process 45231 stopped
>>>> Executable module set to "/Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx".
>>>> Architecture set to: x86_64-apple-macosx.
>>>> (lldb) bt
>>>> * thread #1: tid = 0x168535, 0x00007fff8c1859aa libsystem_kernel.dylib`select$DARWIN_EXTSN + 10, queue = 'com.apple.main-thread, stop reason = signal SIGSTOP
>>>> frame #0: 0x00007fff8c1859aa libsystem_kernel.dylib`select$DARWIN_EXTSN + 10
>>>> frame #1: 0x0000000106b73ea0 libmpi.1.dylib`select_dispatch(base=0x00007f84c3c0b430, arg=0x00007f84c3c0b3e0, tv=0x00007fff5924ca70) + 80 at select.c:174
>>>> frame #2: 0x0000000106b3eb0f libmpi.1.dylib`opal_event_base_loop(base=0x00007f84c3c0b430, flags=5) + 415 at event.c:838
>>>>
>>>> Both processors are at this state.
>>>>
>>>> Here’s the output from otool -L ./hello_cxx:
>>>>
>>>> $ )otool -L ./hello_cxx
>>>> ./hello_cxx:
>>>> /Users/meredithk/tools/openmpi/lib/libmpi_cxx.1.dylib (compatibility version 2.0.0, current version 2.2.0)
>>>> /Users/meredithk/tools/openmpi/lib/libmpi.1.dylib (compatibility version 2.0.0, current version 2.8.0)
>>>> /opt/local/lib/libgcc/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.18.0)
>>>> /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
>>>> /opt/local/lib/libgcc/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
>>>>
>>>>
>>>> On Nov 25, 2013, at 9:14 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>
>>>>> Mac OS X 1.9 dropped support for gdb. Please report the output of lldb instead.
>>>>>
>>>>> Also, can you run “otool -L ./hello_cxx” and report the output.
>>>>>
>>>>> Thanks,
>>>>> George.
>>>>>
>>>>>
>>>>> On Nov 25, 2013, at 15:09 , Meredith, Karl <karl.meredith_at_[hidden]> wrote:
>>>>>
>>>>>> I do have DYLD_LIBRARY_PATH set to the same paths as LD_LIBRARY_PATH. This does not resolve the problem. The code still hangs on MPI::Init().
>>>>>>
>>>>>> Another thing I tried is I recompiled openmpi with the debug flags activated:
>>>>>> ./configure --prefix=$HOME/tools/openmpi --enable-debug
>>>>>> make
>>>>>> make install
>>>>>>
>>>>>> Then, I attached to the running process using gdb. I tried to do a back trace and see where it was hanging up at, but all I got was this:
>>>>>> Attaching to process 45231
>>>>>> Reading symbols from /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx...Reading symbols from /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx.dSYM/Contents/Resources/DWARF/hello_cxx...done.
>>>>>> done.
>>>>>> 0x00007fff8c1859aa in ?? ()
>>>>>> (gdb) bt
>>>>>> #0 0x00007fff8c1859aa in ?? ()
>>>>>> #1 0x0000000106b73ea0 in ?? ()
>>>>>> #2 0x706d6e65706f2f2f in ?? ()
>>>>>> #3 0x0000000000000001 in ?? ()
>>>>>> #4 0x0000000000000000 in ?? ()
>>>>>>
>>>>>> This output from gdb was not terribly helpful to me.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Nov 25, 2013, at 8:30 AM, Hammond, Simon David (-EXP) <sdhammo_at_[hidden]<mailto:sdhammo_at_[hidden]>> wrote:
>>>>>>
>>>>>> We have occasionally had a problem like this when we set LD_LIBRARY_PATH only. On OSX you may need to set DYLD_LIBRARY_PATH instead ( set it to the same lib directory )
>>>>>>
>>>>>> Can you try that and see if it resolves the problem?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Si Hammond
>>>>>> Sandia National Laboratories
>>>>>> Remote Connection
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Meredith, Karl [karl.meredith_at_[hidden]<mailto:karl.meredith_at_[hidden]>]
>>>>>> Sent: Monday, November 25, 2013 06:25 AM Mountain Standard Time
>>>>>> To: Open MPI Users
>>>>>> Subject: [EXTERNAL] Re: [OMPI users] open-mpi on Mac OS 10.9 (Mavericks)
>>>>>>
>>>>>>
>>>>>> I do have these two environment variables set:
>>>>>>
>>>>>> LD_LIBRARY_PATH=/Users/meredithk/tools/openmpi/lib
>>>>>> PATH=/Users/meredithk/tools/openmpi/bin
>>>>>>
>>>>>> Running mpirun seems to work fine with a simple command, like hostname:
>>>>>>
>>>>>> $ )mpirun -n 2 hostname
>>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com>
>>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com>
>>>>>>
>>>>>> I am trying to run the simple hello_cxx example from the openmpi distribution, compiled as such:
>>>>>> mpic++ -g hello_cxx.cc -o hello_cxx
>>>>>>
>>>>>> It compiles fine, without warning or error. However, when I go to run the example, it stalls on the MPI::Init() command:
>>>>>> mpirun -np 1 hello_cxx
>>>>>> It never errors out or crashes. It simply hangs.
>>>>>>
>>>>>> I am using the same mpic++ and mpirun version:
>>>>>> $ )which mpirun
>>>>>> /Users/meredithk/tools/openmpi/bin/mpirun
>>>>>>
>>>>>> $ )which mpic++
>>>>>> /Users/meredithk/tools/openmpi/bin/mpic++
>>>>>>
>>>>>> Not quite sure what else to check.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Nov 23, 2013, at 5:29 PM, Ralph Castain <rhc_at_[hidden]<mailto:rhc_at_[hidden]>> wrote:
>>>>>>
>>>>>>> Strange - I run on Mavericks now without problem. Can you run "mpirun -n 1 hostname"?
>>>>>>>
>>>>>>> You also might want to check your PATH and LD_LIBRARY_PATH to ensure you have the prefix where you installed OMPI 1.6.5 at the front. Mac distributes a very old version of OMPI with its software and you don't want to pick it up by mistake.
>>>>>>>
>>>>>>>
>>>>>>> On Nov 22, 2013, at 1:45 PM, Meredith, Karl <karl.meredith_at_[hidden]<mailto:karl.meredith_at_[hidden]>> wrote:
>>>>>>>
>>>>>>>> I recently upgraded my 2013 Macbook Pro (Retina display) from 10.8 to 10.9. I downloaded and installed openmpi-1.6.5 and compiled it with gcc 4.8 (gcc installed from macports).
>>>>>>>> openmpi compiled and installed without error.
>>>>>>>>
>>>>>>>> However, when I try to run any of the example test cases, the code gets stuck inside the first MPI::Init() call and never returns.
>>>>>>>>
>>>>>>>> Any thoughts on what might be going wrong?
>>>>>>>>
>>>>>>>> The same install on OS 10.8 works fine and the example test cases run without error.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]<mailto:users_at_[hidden]>
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]<mailto:users_at_[hidden]>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]<mailto:users_at_[hidden]>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]<mailto:users_at_[hidden]>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/