Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-12-20 19:15:59


This is the same problem Jeff and I are looking at on Solaris - it requires a slow machine to make it appear. I'm investigating and think I know where the issue might lie (a timer that is firing to indicate a failed connection attempt and causing a race condition)

On Dec 20, 2013, at 4:02 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> FWIW:
> I've confirmed that this is REGRESSION relative to 1.7.2, which works fine on OpenBSD-5
>
> I could not build 1.7.3 due to some of issues fixed for 1.7.4rc in the past 24 hours.
> I am going to try back-porting the fix(es) to see if 1.7.3 works or not .
>
> -Paul
>
>
> On Fri, Dec 20, 2013 at 3:16 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
> Below is the backtrace again, this time configured w/ --enable-debug and for all threads.
> -Paul
>
> Thread 2 (thread 1021110):
> #0 0x00001bc0ef6c5e3a in nanosleep () at <stdin>:2
> #1 0x00001bc0f317c2d4 in nanosleep (rqtp=0x7f7ffffbc900, rmtp=0x0)
> at /usr/src/lib/librthread/rthread_cancel.c:274
> #2 0x00001bc0f2cd4621 in orte_routed_base_register_sync (setup=true)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/routed/base/routed_base_fns.c:344
> #3 0x00001bc0efc5d602 in init_routes (job=3563782145, ndat=0x0)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/routed/binomial/routed_binomial.c:705
> #4 0x00001bc0f2c9c832 in orte_ess_base_app_setup (db_restrict_local=true)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/ess/base/ess_base_std_app.c:233
> #5 0x00001bc0f39ea9ec in rte_init ()
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/ess/env/ess_env_module.c:146
> #6 0x00001bc0f2c68764 in orte_init (pargc=0x0, pargv=0x0, flags=32)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/runtime/orte_init.c:158
> #7 0x00001bc0f75061c5 in ompi_mpi_init (argc=1, argv=0x7f7ffffbced0, requested=0, provided=0x7f7ffffbce38)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/ompi/runtime/ompi_mpi_init.c:451
> #8 0x00001bc0f7544b96 in PMPI_Init (argc=0x7f7ffffbce6c, argv=0x7f7ffffbce60) at pinit.c:84
> #9 0x00001bbeec701093 in main (argc=1, argv=0x7f7ffffbced0) at ring_c.c:19
> Current language: auto; currently asm
>
> Thread 1 (thread 1023703):
> #0 0x00001bc0ef6d68fa in poll () at <stdin>:2
> #1 0x00001bc0f317c0fd in poll (fds=0x1bc0f9482d00, nfds=2, timeout=-1)
> at /usr/src/lib/librthread/rthread_cancel.c:331
> #2 0x00001bc0eebf47a8 in poll_dispatch (base=0x1bc0f5987400, tv=0x0)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/opal/mca/event/libevent2021/libevent/poll.c:165
> #3 0x00001bc0eebe8314 in opal_libevent2021_event_base_loop (base=0x1bc0f5987400, flags=1)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/opal/mca/event/libevent2021/libevent/event.c:1631
> #4 0x00001bc0f2c68855 in orte_progress_thread_engine (obj=0x1bc0f310e160)
> at /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/runtime/orte_init.c:180
> #5 0x00001bc0f317911e in _rthread_start (v=Variable "v" is not available.
> ) at /usr/src/lib/librthread/rthread.c:122
> #6 0x00001bc0ef6c003b in __tfork_thread () at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:75
> Cannot access memory at address 0x1bc0f857c000
>
>
>
> On Fri, Dec 20, 2013 at 2:48 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
> Brian,
>
> Of course, I should have thought of that myself.
> See below for backtrace from a singleton run.
>
> I'm starting an --enable-debug build to maybe get some line number info too.
>
> -Paul
>
> (gdb) where
> #0 0x00000406457a9e3a in nanosleep () at <stdin>:2
> #1 0x000004063947e2d4 in nanosleep (rqtp=0x7f7ffffeca30, rmtp=0x0)
> at /usr/src/lib/librthread/rthread_cancel.c:274
> #2 0x0000040644a5a89b in orte_routed_base_register_sync ()
> from /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0
> #3 0x00000406490d943c in init_routes ()
> from /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/openmpi/mca_routed_binomial.so
> #4 0x0000040644a3c37f in orte_ess_base_app_setup ()
> from /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0
> #5 0x000004063eb1797d in rte_init ()
> from /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/openmpi/mca_ess_env.so
> #6 0x0000040644a1a3fe in orte_init ()
> from /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0
> #7 0x00000406482c7976 in ompi_mpi_init ()
> from /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libmpi.so.4.0
> #8 0x00000406482eac92 in PMPI_Init ()
> from /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libmpi.so.4.0
> #9 0x0000040438c01093 in main (argc=1, argv=0x7f7ffffece60) at ring_c.c:19
> Current language: auto; currently asm
>
>
>
> On Fri, Dec 20, 2013 at 2:38 PM, Barrett, Brian W <bwbarre_at_[hidden]> wrote:
> Paul -
>
> Any chance you could grab a stack trace from the mpi app? That's probably the fastest next step
>
> Brian
>
>
>
> Sent with Good (www.good.com)
>
>
> -----Original Message-----
> From: Paul Hargrove [phhargrove_at_[hidden]]
> Sent: Friday, December 20, 2013 03:33 PM Mountain Standard Time
> To: Open MPI Developers
> Subject: [EXTERNAL] [OMPI devel] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs
>
> With plenty of help from Jeff and Ralph's bug fixes in the past 24 hours, I can now build OMPI for NetBSD. However, running even a simple example fails:
>
> Having set PATH and LD_LIBARY_PATH:
> $ mpirun -np 1 examples/ring_c
> just hangs
>
> Output from "top" shows idle procs:
> PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU COMMAND
> 31841 phargrov 10 0 2140K 3960K sleep/1 nanosle 0:00 0.00% ring_c
> 13490 phargrov 2 0 2540K 4892K sleep/1 poll 0:00 0.00% orterun
>
> Distrusting then env vars and relying instead on the auto-prefix behavior:
> $ /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/bin/mpirun -np 1 examples/ring_c
> also hangs
>
> Not sure exactly what to infer from this, but a "bogus" btl doesn't produce any complaint, which may indicate how far startup got:
> $ mpirun -mca btl bogus -np 1 examples/ring_c
> Still hangs, and no complaint about the blt selection
>
> All three cases above are singleton (-np 1) runs, but the behavior with "-np 2" is the same.
>
> This does NOT appear to be an ORTE problem:
> -bash-4.2$ orterun -np 1 date
> Fri Dec 20 14:11:42 PST 2013
> -bash-4.2$ orterun -np 2 date
> Fri Dec 20 14:11:45 PST 2013
> Fri Dec 20 14:11:45 PST 2013
>
> Let me know what sort of verbose mca parameters to set and I'll collect the info.
> Compressed output of "ompi_info --all" is attached.
>
> -Paul
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel