Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2013-12-20 19:28:21


FYI: My Solaris-10/SPARC build finally finished and *does* appear to be
showing this same behavior.

-Paul

On Fri, Dec 20, 2013 at 4:15 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> This is the same problem Jeff and I are looking at on Solaris - it
> requires a slow machine to make it appear. I'm investigating and think I
> know where the issue might lie (a timer that is firing to indicate a failed
> connection attempt and causing a race condition)
>
>
> On Dec 20, 2013, at 4:02 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> FWIW:
> I've confirmed that this is REGRESSION relative to 1.7.2, which works fine
> on OpenBSD-5
>
> I could not build 1.7.3 due to some of issues fixed for 1.7.4rc in the
> past 24 hours.
> I am going to try back-porting the fix(es) to see if 1.7.3 works or not .
>
> -Paul
>
>
> On Fri, Dec 20, 2013 at 3:16 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> Below is the backtrace again, this time configured w/ --enable-debug and
>> for all threads.
>> -Paul
>>
>> Thread 2 (thread 1021110):
>> #0 0x00001bc0ef6c5e3a in nanosleep () at <stdin>:2
>> #1 0x00001bc0f317c2d4 in nanosleep (rqtp=0x7f7ffffbc900, rmtp=0x0)
>> at /usr/src/lib/librthread/rthread_cancel.c:274
>> #2 0x00001bc0f2cd4621 in orte_routed_base_register_sync (setup=true)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/routed/base/routed_base_fns.c:344
>> #3 0x00001bc0efc5d602 in init_routes (job=3563782145, ndat=0x0)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/routed/binomial/routed_binomial.c:705
>> #4 0x00001bc0f2c9c832 in orte_ess_base_app_setup (db_restrict_local=true)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/ess/base/ess_base_std_app.c:233
>> #5 0x00001bc0f39ea9ec in rte_init ()
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/ess/env/ess_env_module.c:146
>> #6 0x00001bc0f2c68764 in orte_init (pargc=0x0, pargv=0x0, flags=32)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/runtime/orte_init.c:158
>> #7 0x00001bc0f75061c5 in ompi_mpi_init (argc=1, argv=0x7f7ffffbced0,
>> requested=0, provided=0x7f7ffffbce38)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/ompi/runtime/ompi_mpi_init.c:451
>> #8 0x00001bc0f7544b96 in PMPI_Init (argc=0x7f7ffffbce6c,
>> argv=0x7f7ffffbce60) at pinit.c:84
>> #9 0x00001bbeec701093 in main (argc=1, argv=0x7f7ffffbced0) at
>> ring_c.c:19
>> Current language: auto; currently asm
>>
>> Thread 1 (thread 1023703):
>> #0 0x00001bc0ef6d68fa in poll () at <stdin>:2
>> #1 0x00001bc0f317c0fd in poll (fds=0x1bc0f9482d00, nfds=2, timeout=-1)
>> at /usr/src/lib/librthread/rthread_cancel.c:331
>> #2 0x00001bc0eebf47a8 in poll_dispatch (base=0x1bc0f5987400, tv=0x0)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/opal/mca/event/libevent2021/libevent/poll.c:165
>> #3 0x00001bc0eebe8314 in opal_libevent2021_event_base_loop
>> (base=0x1bc0f5987400, flags=1)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/opal/mca/event/libevent2021/libevent/event.c:1631
>> #4 0x00001bc0f2c68855 in orte_progress_thread_engine (obj=0x1bc0f310e160)
>> at
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/runtime/orte_init.c:180
>> #5 0x00001bc0f317911e in _rthread_start (v=Variable "v" is not available.
>> ) at /usr/src/lib/librthread/rthread.c:122
>> #6 0x00001bc0ef6c003b in __tfork_thread () at
>> /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:75
>> Cannot access memory at address 0x1bc0f857c000
>>
>>
>>
>> On Fri, Dec 20, 2013 at 2:48 PM, Paul Hargrove <phhargrove_at_[hidden]>wrote:
>>
>>> Brian,
>>>
>>> Of course, I should have thought of that myself.
>>> See below for backtrace from a singleton run.
>>>
>>> I'm starting an --enable-debug build to maybe get some line number info
>>> too.
>>>
>>> -Paul
>>>
>>> (gdb) where
>>> #0 0x00000406457a9e3a in nanosleep () at <stdin>:2
>>> #1 0x000004063947e2d4 in nanosleep (rqtp=0x7f7ffffeca30, rmtp=0x0)
>>> at /usr/src/lib/librthread/rthread_cancel.c:274
>>> #2 0x0000040644a5a89b in orte_routed_base_register_sync ()
>>> from
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0
>>> #3 0x00000406490d943c in init_routes ()
>>> from
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/openmpi/mca_routed_binomial.so
>>> #4 0x0000040644a3c37f in orte_ess_base_app_setup ()
>>> from
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0
>>> #5 0x000004063eb1797d in rte_init ()
>>> from
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/openmpi/mca_ess_env.so
>>> #6 0x0000040644a1a3fe in orte_init ()
>>> from
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0
>>> #7 0x00000406482c7976 in ompi_mpi_init ()
>>> from
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libmpi.so.4.0
>>> #8 0x00000406482eac92 in PMPI_Init ()
>>> from
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libmpi.so.4.0
>>> #9 0x0000040438c01093 in main (argc=1, argv=0x7f7ffffece60) at
>>> ring_c.c:19
>>> Current language: auto; currently asm
>>>
>>>
>>>
>>> On Fri, Dec 20, 2013 at 2:38 PM, Barrett, Brian W <bwbarre_at_[hidden]>wrote:
>>>
>>>> Paul -
>>>>
>>>> Any chance you could grab a stack trace from the mpi app? That's
>>>> probably the fastest next step
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>>> Sent with Good (www.good.com)
>>>>
>>>>
>>>> -----Original Message-----
>>>> *From: *Paul Hargrove [phhargrove_at_[hidden]]
>>>> *Sent: *Friday, December 20, 2013 03:33 PM Mountain Standard Time
>>>> *To: *Open MPI Developers
>>>> *Subject: *[EXTERNAL] [OMPI devel] 1.7.4rc2r30031 - OpenBSD-5 mpirun
>>>> hangs
>>>>
>>>> With plenty of help from Jeff and Ralph's bug fixes in the past 24
>>>> hours, I can now build OMPI for NetBSD. However, running even a simple
>>>> example fails:
>>>>
>>>> Having set PATH and LD_LIBARY_PATH:
>>>> $ mpirun -np 1 examples/ring_c
>>>> just hangs
>>>>
>>>> Output from "top" shows idle procs:
>>>> PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU
>>>> COMMAND
>>>> 31841 phargrov 10 0 2140K 3960K sleep/1 nanosle 0:00 0.00%
>>>> ring_c
>>>> 13490 phargrov 2 0 2540K 4892K sleep/1 poll 0:00 0.00%
>>>> orterun
>>>>
>>>> Distrusting then env vars and relying instead on the auto-prefix
>>>> behavior:
>>>> $ /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/bin/mpirun
>>>> -np 1 examples/ring_c
>>>> also hangs
>>>>
>>>> Not sure exactly what to infer from this, but a "bogus" btl doesn't
>>>> produce any complaint, which may indicate how far startup got:
>>>> $ mpirun -mca btl bogus -np 1 examples/ring_c
>>>> Still hangs, and no complaint about the blt selection
>>>>
>>>> All three cases above are singleton (-np 1) runs, but the behavior
>>>> with "-np 2" is the same.
>>>>
>>>> This does NOT appear to be an ORTE problem:
>>>> -bash-4.2$ orterun -np 1 date
>>>> Fri Dec 20 14:11:42 PST 2013
>>>> -bash-4.2$ orterun -np 2 date
>>>> Fri Dec 20 14:11:45 PST 2013
>>>> Fri Dec 20 14:11:45 PST 2013
>>>>
>>>> Let me know what sort of verbose mca parameters to set and I'll
>>>> collect the info.
>>>> Compressed output of "ompi_info --all" is attached.
>>>>
>>>> -Paul
>>>>
>>>> --
>>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>>> Future Technologies Group
>>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>> Future Technologies Group
>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900