Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] [EXTERNAL] Re: 1.7.4rc2r30031 - FreeBSD-9 mpirun hangs
From: Barrett, Brian W (bwbarre_at_[hidden])
Date: 2013-12-20 19:03:01


I'm guessing that this is related to the threading changes that came with
some ORTE changes between 1.7.3 and 1.7.4. I'm building a FreeBSD VM to
see if I can make some progress on that, but I live in the land of slow
bandwidth, so it might not be for a couple days.

Brian

On 12/20/13 5:00 PM, "Paul Hargrove" <phhargrove_at_[hidden]> wrote:

>FWIW:
>I've confirmed that this is REGRESSION relative to 1.7.3, which works
>fine on FreeBSD-9
>
>
>-Paul
>
>
>
>On Fri, Dec 20, 2013 at 3:30 PM, Paul Hargrove
><phhargrove_at_[hidden]> wrote:
>
>And the FreeBSD backtraces again, this time configured with
>--enable-debug and for all threads:
>
>
>The 100%-cpu ring_c process:
>
>
>(gdb) thread apply all where
>
>
>Thread 2 (Thread 802007400 (LWP 182916/ring_c)):
>#0 0x0000000800de7aac in sched_yield () from /lib/libc.so.7
>#1 0x00000008013c7a5a in opal_progress ()
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/runtime/opal_progress.c:199
>#2 0x00000008008670ec in ompi_mpi_init (argc=1, argv=0x7fffffffd3e0,
>requested=0, provided=0x7fffffffd328)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>mpi/runtime/ompi_mpi_init.c:618
>#3 0x000000080089aefe in PMPI_Init (argc=0x7fffffffd36c,
>argv=0x7fffffffd360) at pinit.c:84
>#4 0x0000000000400963 in main (argc=1, argv=0x7fffffffd3e0) at
>ring_c.c:19
>
>
>Thread 1 (Thread 802007800 (LWP 186415/ring_c)):
>#0 0x0000000800e2711c in poll () from /lib/libc.so.7
>#1 0x0000000800b727fe in poll () from /lib/libthr.so.3
>#2 0x000000080142edc1 in poll_dispatch (base=0x8020cd900, tv=0x0)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/poll.c:165
>#3 0x0000000801422ca1 in opal_libevent2021_event_base_loop
>(base=0x8020cd900, flags=1)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/event.c:1631
>#4 0x00000008010f2c22 in orte_progress_thread_engine (obj=0x80139b160)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/runtime/orte_init.c:180
>#5 0x0000000800b700a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>The idle ring_c process:
>
>
>(gdb) thread apply all where
>
>
>Thread 2 (Thread 802007400 (LWP 183983/ring_c)):
>#0 0x0000000800e6c44c in nanosleep () from /lib/libc.so.7
>#1 0x0000000800b729d5 in nanosleep () from /lib/libthr.so.3
>#2 0x0000000801161618 in orte_routed_base_register_sync (setup=true)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/routed/base/routed_base_fns.c:344
>#3 0x0000000802a0a0a2 in init_routes (job=2628321281 <tel:2628321281>,
>ndat=0x0)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/routed/binomial/routed_binomial.c:705
>#4 0x00000008011272ce in orte_ess_base_app_setup (db_restrict_local=true)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/ess/base/ess_base_std_app.c:233
>#5 0x0000000802401408 in rte_init ()
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/mca/ess/env/ess_env_module.c:146
>#6 0x00000008010f2b28 in orte_init (pargc=0x0, pargv=0x0, flags=32)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/runtime/orte_init.c:158
>#7 0x0000000800866bde in ompi_mpi_init (argc=1, argv=0x7fffffffd3e0,
>requested=0, provided=0x7fffffffd328)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>mpi/runtime/ompi_mpi_init.c:451
>#8 0x000000080089aefe in PMPI_Init (argc=0x7fffffffd36c,
>argv=0x7fffffffd360) at pinit.c:84
>#9 0x0000000000400963 in main (argc=1, argv=0x7fffffffd3e0) at
>ring_c.c:19
>
>
>Thread 1 (Thread 802007800 (LWP 186412/ring_c)):
>#0 0x0000000800e2711c in poll () from /lib/libc.so.7
>#1 0x0000000800b727fe in poll () from /lib/libthr.so.3
>#2 0x000000080142edc1 in poll_dispatch (base=0x8020cd900, tv=0x0)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/poll.c:165
>#3 0x0000000801422ca1 in opal_libevent2021_event_base_loop
>(base=0x8020cd900, flags=1)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>pal/mca/event/libevent2021/libevent/event.c:1631
>#4 0x00000008010f2c22 in orte_progress_thread_engine (obj=0x80139b160)
> at
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o
>rte/runtime/orte_init.c:180
>#5 0x0000000800b700a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>-Paul
>
>
>On Fri, Dec 20, 2013 at 2:59 PM, Paul Hargrove
><phhargrove_at_[hidden]> wrote:
>
>This case is not quite like my OpenBSD-5 report.
>On FreeBSD-9 I *can* run singletons, but "-np 2" hangs.
>
>
>The following hangs:
>$ mpirun -np 2 examples/ring_c
>
>
>
>The following complains about the "bogus" btl selection.
>So this is not the same as my problem with OpenBSD-5:
>$ mpirun -mca btl bogus -np 2 examples/ring_c
>[freebsd9-amd64.qemu:05926] mca: base: components_open: component pml /
>bfo open function failed
>[freebsd9-amd64.qemu:05926] mca: base: components_open: component pml /
>ob1 open function failed
>[freebsd9-amd64.qemu:05926] PML ob1 cannot be selected
>--------------------------------------------------------------------------
>A requested component was not found, or was unable to be opened. This
>means that this component is either not installed or is unable to be
>used on your system (e.g., sometimes this means that shared libraries
>that the component requires are unable to be found/loaded). Note that
>Open MPI stopped checking at the first component that it did not find.
>
>
>Host: freebsd9-amd64.qemu
>Framework: btl
>Component: bogus
>--------------------------------------------------------------------------
>--------------------------------------------------------------------------
>No available pml components were found!
>
>
>This means that there are no components of this type installed on your
>system or all the components reported that they could not be used.
>
>
>This is a fatal error; your MPI process is likely to abort. Check the
>output of the "ompi_info" command and ensure that components of this
>type are available on your system. You may also wish to check the
>value of the "component_path" MCA parameter and ensure that it has at
>least one directory that contains valid MCA components.
>--------------------------------------------------------------------------
>
>
>
>
>
>For the non-bogus case, "top" show one idle and one active ring_c process:
> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> 5933 phargrov 2 29 0 98M 6384K select 1 0:32 100.00% ring_c
> 5931 phargrov 2 20 0 77844K 4856K select 0 0:00 0.00% orterun
> 5932 phargrov 2 24 0 51652K 4960K select 0 0:00 0.00% ring_c
>
>
>
>A backtrace for the 100%-cpu ring_c process:
>(gdb) where
>#0 0x0000000800d9811c in poll () from /lib/libc.so.7
>#1 0x0000000800ae37fe in poll () from /lib/libthr.so.3
>#2 0x00000008013259aa in poll_dispatch ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#3 0x000000080131eb50 in opal_libevent2021_event_base_loop ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#4 0x000000080106395d in orte_progress_thread_engine ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-rte
>.so.7
>#5 0x0000000800ae10a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>And for the idle ring_c process:
>(gdb) where
>#0 0x0000000800d9811c in poll () from /lib/libc.so.7
>#1 0x0000000800ae37fe in poll () from /lib/libthr.so.3
>#2 0x00000008013259aa in poll_dispatch ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#3 0x000000080131eb50 in opal_libevent2021_event_base_loop ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal
>.so.7
>#4 0x000000080106395d in orte_progress_thread_engine ()
> from
>/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-rte
>.so.7
>#5 0x0000000800ae10a4 in pthread_getprio () from /lib/libthr.so.3
>#6 0x0000000000000000 in ?? ()
>Error accessing memory address 0x7fffffbfe000: Bad address.
>
>
>
>
>
>They look to be the same, but I double checked that these are correct.
>
>
>-Paul
>
>
>
>--
>Paul H. Hargrove PHHargrove_at_[hidden]
>Future Technologies Group
>Computer and Data Sciences Department Tel:
>+1-510-495-2352 <tel:%2B1-510-495-2352>
>Lawrence Berkeley National Laboratory Fax:
>+1-510-486-6900 <tel:%2B1-510-486-6900>
>
>
>
>
>
>
>
>
>
>--
>Paul H. Hargrove PHHargrove_at_[hidden]
>Future Technologies Group
>Computer and Data Sciences Department Tel:
>+1-510-495-2352 <tel:%2B1-510-495-2352>
>Lawrence Berkeley National Laboratory Fax:
>+1-510-486-6900 <tel:%2B1-510-486-6900>
>
>
>
>
>
>
>
>
>
>--
>Paul H. Hargrove PHHargrove_at_[hidden]
>Future Technologies Group
>Computer and Data Sciences Department Tel: +1-510-495-2352
>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories