Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [EXTERNAL] Re: 1.7.4rc2r30031 - FreeBSD-9 mpirun hangs
From: Barrett, Brian W (bwbarre_at_[hidden])
Date: 2013-12-20 19:03:01


I'm guessing that this is related to the threading changes that came with
some ORTE changes between 1.7.3 and 1.7.4. I'm building a FreeBSD VM to
see if I can make some progress on that, but I live in the land of slow
bandwidth, so it might not be for a couple days.

Brian

On 12/20/13 5:00 PM, "Paul Hargrove" <phhargrove_at_[hidden]> wrote:

>FWIW:
>I've confirmed that this is REGRESSION relative to 1.7.3, which works
>fine on FreeBSD-9
>
>
>-Paul
>
>
>
>On Fri, Dec 20, 2013 at 3:30 PM, Paul Hargrove
><phhargrove_at_[hidden]> wrote:
>
>And the FreeBSD backtraces again, this time configured with
>--enable-debug and for all threads:
>
>
>The 100%-cpu ring_c process:
>
>
>(gdb) thread apply all where
>
>
>Thread 2 (Thread 802007400 (LWP 182916/ring_c)):
>#0 0x0000000800de7aac in sched_yield () from /lib/libc.so.7
>#1 0x00000008013c7a5a in opal_progress ()
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/runtime/opal_progress.c:199
>#2 0x00000008008670ec in ompi_mpi_init (argc=1, argv=0x7fffffffd3e0,
>requested=0, provided=0x7fffffffd328)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>mpi/runtime/ompi_mpi_init.c:618
>#3 0x000000080089aefe in PMPI_Init (argc=0x7fffffffd36c,
>argv=0x7fffffffd360) at pinit.c:84
>#4 0x0000000000400963 in main (argc=1, argv=0x7fffffffd3e0) at
>ring_c.c:19
>
>
>Thread 1 (Thread 802007800 (LWP 186415/ring_c)):
>#0 0x0000000800e2711c in poll () from /lib/libc.so.7
>#1 0x0000000800b727fe in poll () from /lib/libthr.so.3
>#2 0x000000080142edc1 in poll_dispatch (base=0x8020cd900, tv=0x0)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/poll.c:165
>#3 0x0000000801422ca1 in opal_libevent2021_event_base_loop
>(base=0x8020cd900, flags=1)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/event.c:1631
>#4 0x00000008010f2c22 in orte_progress_thread_engine (obj=0x80139b160)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/runtime/orte_init.c:180
>#5 0x0000000800b700a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>The idle ring_c process:
>
>
>(gdb) thread apply all where
>
>
>Thread 2 (Thread 802007400 (LWP 183983/ring_c)):
>#0 0x0000000800e6c44c in nanosleep () from /lib/libc.so.7
>#1 0x0000000800b729d5 in nanosleep () from /lib/libthr.so.3
>#2 0x0000000801161618 in orte_routed_base_register_sync (setup=true)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/routed/base/routed_base_fns.c:344
>#3 0x0000000802a0a0a2 in init_routes (job=2628321281 <tel:2628321281>,
>ndat=0x0)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/routed/binomial/routed_binomial.c:705
>#4 0x00000008011272ce in orte_ess_base_app_setup (db_restrict_local=true)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/ess/base/ess_base_std_app.c:233
>#5 0x0000000802401408 in rte_init ()
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/ess/env/ess_env_module.c:146
>#6 0x00000008010f2b28 in orte_init (pargc=0x0, pargv=0x0, flags=32)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/runtime/orte_init.c:158
>#7 0x0000000800866bde in ompi_mpi_init (argc=1, argv=0x7fffffffd3e0,
>requested=0, provided=0x7fffffffd328)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>mpi/runtime/ompi_mpi_init.c:451
>#8 0x000000080089aefe in PMPI_Init (argc=0x7fffffffd36c,
>argv=0x7fffffffd360) at pinit.c:84
>#9 0x0000000000400963 in main (argc=1, argv=0x7fffffffd3e0) at
>ring_c.c:19
>
>
>Thread 1 (Thread 802007800 (LWP 186412/ring_c)):
>#0 0x0000000800e2711c in poll () from /lib/libc.so.7
>#1 0x0000000800b727fe in poll () from /lib/libthr.so.3
>#2 0x000000080142edc1 in poll_dispatch (base=0x8020cd900, tv=0x0)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/poll.c:165
>#3 0x0000000801422ca1 in opal_libevent2021_event_base_loop
>(base=0x8020cd900, flags=1)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/event.c:1631
>#4 0x00000008010f2c22 in orte_progress_thread_engine (obj=0x80139b160)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/runtime/orte_init.c:180
>#5 0x0000000800b700a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>-Paul
>
>
>On Fri, Dec 20, 2013 at 2:59 PM, Paul Hargrove
><phhargrove_at_[hidden]> wrote:
>
>This case is not quite like my OpenBSD-5 report.
>On FreeBSD-9 I *can* run singletons, but "-np 2" hangs.
>
>
>The following hangs:
>$ mpirun -np 2 examples/ring_c
>
>
>
>The following complains about the "bogus" btl selection.
>So this is not the same as my problem with OpenBSD-5:
>$ mpirun -mca btl bogus -np 2 examples/ring_c
>[freebsd9-amd64.qemu:05926] mca: base: components_open: component pml /
>bfo open function failed
>[freebsd9-amd64.qemu:05926] mca: base: components_open: component pml /
>ob1 open function failed
>[freebsd9-amd64.qemu:05926] PML ob1 cannot be selected
>--------------------------------------------------------------------------
>A requested component was not found, or was unable to be opened. This
>means that this component is either not installed or is unable to be
>used on your system (e.g., sometimes this means that shared libraries
>that the component requires are unable to be found/loaded). Note that
>Open MPI stopped checking at the first component that it did not find.
>
>
>Host: freebsd9-amd64.qemu
>Framework: btl
>Component: bogus
>--------------------------------------------------------------------------
>--------------------------------------------------------------------------
>No available pml components were found!
>
>
>This means that there are no components of this type installed on your
>system or all the components reported that they could not be used.
>
>
>This is a fatal error; your MPI process is likely to abort. Check the
>output of the "ompi_info" command and ensure that components of this
>type are available on your system. You may also wish to check the
>value of the "component_path" MCA parameter and ensure that it has at
>least one directory that contains valid MCA components.
>--------------------------------------------------------------------------
>
>
>
>
>
>For the non-bogus case, "top" show one idle and one active ring_c process:
> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> 5933 phargrov 2 29 0 98M 6384K select 1 0:32 100.00% ring_c
> 5931 phargrov 2 20 0 77844K 4856K select 0 0:00 0.00% orterun
> 5932 phargrov 2 24 0 51652K 4960K select 0 0:00 0.00% ring_c
>
>
>
>A backtrace for the 100%-cpu ring_c process:
>(gdb) where
>#0 0x0000000800d9811c in poll () from /lib/libc.so.7
>#1 0x0000000800ae37fe in poll () from /lib/libthr.so.3
>#2 0x00000008013259aa in poll_dispatch ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#3 0x000000080131eb50 in opal_libevent2021_event_base_loop ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#4 0x000000080106395d in orte_progress_thread_engine ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-rte
>.so.7
>#5 0x0000000800ae10a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>And for the idle ring_c process:
>(gdb) where
>#0 0x0000000800d9811c in poll () from /lib/libc.so.7
>#1 0x0000000800ae37fe in poll () from /lib/libthr.so.3
>#2 0x00000008013259aa in poll_dispatch ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#3 0x000000080131eb50 in opal_libevent2021_event_base_loop ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#4 0x000000080106395d in orte_progress_thread_engine ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-rte
>.so.7
>#5 0x0000000800ae10a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>They look to be the same, but I double checked that these are correct.
>
>
>-Paul
>
>
>
>--
>Paul H. Hargrove PHHargrove_at_[hidden]
>Future Technologies Group
>Computer and Data Sciences Department Tel:
>+1-510-495-2352 <tel:%2B1-510-495-2352>
>Lawrence Berkeley National Laboratory Fax:
>+1-510-486-6900 <tel:%2B1-510-486-6900>
>
>
>
>
>
>
>
>
>
>--
>Paul H. Hargrove PHHargrove_at_[hidden]
>Future Technologies Group
>Computer and Data Sciences Department Tel:
>+1-510-495-2352 <tel:%2B1-510-495-2352>
>Lawrence Berkeley National Laboratory Fax:
>+1-510-486-6900 <tel:%2B1-510-486-6900>
>
>
>
>
>
>
>
>
>
>--
>Paul H. Hargrove PHHargrove_at_[hidden]
>Future Technologies Group
>Computer and Data Sciences Department Tel: +1-510-495-2352
>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories