Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications
From: jody (jody.xha_at_[hidden])
Date: 2009-06-11 06:56:38


More info:
I checked and found that not all nodes are equal:
the ones that don't work have mpi-threads *and* progress-threads enabled,
whereas the ones that work have only mpi-threads enabled

Is there a problem when both thread-types are enabled?

Jody

On Thu, Jun 11, 2009 at 12:19 PM, jody<jody.xha_at_[hidden]> wrote:
> Hi
>
> After updating all my nodes to Open-MPI 1.3.2 (with
> --enable-mpi-threads some of them fail to execute a simple MPI test
> program - they seem to hang.
> With --debug-daemons the application seems to execute (two line os
> output) but hangs before returning:
>
> [jody_at_aplankton neander]$ mpirun -np 2 --host nano_06 --debug-daemons ./MPITest
> Daemon was launched on nano_06 - beginning to initialize
> Daemon [[44301,0],1] checking in as pid 5166 on host nano_06
> Daemon [[44301,0],1] not using static ports
> [nano_06:05166] [[44301,0],1] orted: up and running - waiting for commands!
> [plankton:23859] [[44301,0],0] node[0].name plankton daemon 0 arch ffca0200
> [plankton:23859] [[44301,0],0] node[1].name nano_06 daemon 1 arch ffca0200
> [plankton:23859] [[44301,0],0] orted_cmd: received add_local_procs
> [nano_06:05166] [[44301,0],1] node[0].name plankton daemon 0 arch ffca0200
> [nano_06:05166] [[44301,0],1] node[1].name nano_06 daemon 1 arch ffca0200
> [nano_06:05166] [[44301,0],1] orted_cmd: received add_local_procs
> [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
> local proc [[44301,1],0]
> [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
> local proc [[44301,1],1]
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
> [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
> [nano_06]I am #0/2
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06]I am #1/2
> [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_recv: received sync from local
> proc [[44301,1],1]
> [nano_06:05166] [[44301,0],1] orted_recv: received sync from local
> proc [[44301,1],0]
>  (Here it hangs)
>
> Some don't even get to execute:
> [jody_at_plankton neander]$ mpirun -np 2 --host nano_01 --debug-daemons ./MPITest
> Daemon was launched on nano_01 - beginning to initialize
> Daemon [[44293,0],1] checking in as pid 5044 on host nano_01
> Daemon [[44293,0],1] not using static ports
> [nano_01:05044] [[44293,0],1] orted: up and running - waiting for commands!
> [plankton:23867] [[44293,0],0] node[0].name plankton daemon 0 arch ffca0200
> [plankton:23867] [[44293,0],0] node[1].name nano_01 daemon 1 arch ffca0200
> [plankton:23867] [[44293,0],0] orted_cmd: received add_local_procs
> [nano_01:05044] [[44293,0],1] node[0].name plankton daemon 0 arch ffca0200
> [nano_01:05044] [[44293,0],1] node[1].name nano_01 daemon 1 arch ffca0200
> [nano_01:05044] [[44293,0],1] orted_cmd: received add_local_procs
> [nano_01:05044] [[44293,0],1] orted_recv: received sync+nidmap from
> local proc [[44293,1],0]
> [nano_01:05044] [[44293,0],1] orted_cmd: received collective data cmd
>  (Here it hangs)
>
> When i call one of the bad nodes with only 1 processor and debug-daemons,
> it works fine (output & clean completion), but without debug-daemons it hangs.
> But sometimes there is a crash (not always reproducible):
>
> [jody_at_plankton neander]$ mpirun -np 1 --host nano_04 --debug-daemons ./MPITest
> Daemon was launched on nano_04 - beginning to initialize
> Daemon [[44431,0],1] checking in as pid 5333 on host nano_04
> Daemon [[44431,0],1] not using static ports
> [plankton:23985] [[44431,0],0] node[0].name plankton daemon 0 arch ffca0200
> [plankton:23985] [[44431,0],0] node[1].name nano_04 daemon 1 arch ffca0200
> [plankton:23985] [[44431,0],0] orted_cmd: received add_local_procs
> [nano_04:05333] [[44431,0],1] orted: up and running - waiting for commands!
> [nano_04:05333] [[44431,0],1] node[0].name plankton daemon 0 arch ffca0200
> [nano_04:05333] [[44431,0],1] node[1].name nano_04 daemon 1 arch ffca0200
> [nano_04:05333] [[44431,0],1] orted_cmd: received add_local_procs
> [nano_04:05333] [[44431,0],1] orted_recv: received sync+nidmap from
> local proc [[44431,1],0]
> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
> [nano_04]I am #0/1
> [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_recv: received sync from local
> proc [[44431,1],0]
> [nano_04:05333] [[44431,0],1] orted_cmd: received iof_complete cmd
> [nano_04:05333] [[44431,0],1] orted_cmd: received waitpid_fired cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received exit
> [nano_04:05333] [[44431,0],1] orted_cmd: received exit
> [nano_04:05333] [[44431,0],1] orted: finalizing
> [nano_04:05333] *** Process received signal ***
> [nano_04:05333] Signal: Segmentation fault (11)
> [nano_04:05333] Signal code: Address not mapped (1)
> [nano_04:05333] Failing at address: 0xb7493e20
> [nano_04:05333] [ 0] [0xffffe40c]
> [nano_04:05333] [ 1]
> /opt/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xb7e65417]
> [nano_04:05333] [ 2]
> /opt/openmpi/lib/libopen-pal.so.0(opal_event_dispatch+0x1e)
> [0xb7e6543e]
> [nano_04:05333] [ 3]
> /opt/openmpi/lib/libopen-rte.so.0(orte_daemon+0x761) [0xb7ed3d71]
> [nano_04:05333] [ 4] orted [0x80487b4]
> [nano_04:05333] [ 5] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7cc060c]
> [nano_04:05333] [ 6] orted [0x8048691]
> [nano_04:05333] *** End of error message ***
>
>
>
>
> Is that perhaps a consequence of configuring with --enable-mpi-threads
> and --enable-progress-threads?
>
> Thank You
>  Jody
>