Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications
From: jody (jody.xha_at_[hidden])
Date: 2009-06-11 06:19:00


Hi

After updating all my nodes to Open-MPI 1.3.2 (with
--enable-mpi-threads some of them fail to execute a simple MPI test
program - they seem to hang.
With --debug-daemons the application seems to execute (two line os
output) but hangs before returning:

[jody_at_aplankton neander]$ mpirun -np 2 --host nano_06 --debug-daemons ./MPITest
Daemon was launched on nano_06 - beginning to initialize
Daemon [[44301,0],1] checking in as pid 5166 on host nano_06
Daemon [[44301,0],1] not using static ports
[nano_06:05166] [[44301,0],1] orted: up and running - waiting for commands!
[plankton:23859] [[44301,0],0] node[0].name plankton daemon 0 arch ffca0200
[plankton:23859] [[44301,0],0] node[1].name nano_06 daemon 1 arch ffca0200
[plankton:23859] [[44301,0],0] orted_cmd: received add_local_procs
[nano_06:05166] [[44301,0],1] node[0].name plankton daemon 0 arch ffca0200
[nano_06:05166] [[44301,0],1] node[1].name nano_06 daemon 1 arch ffca0200
[nano_06:05166] [[44301,0],1] orted_cmd: received add_local_procs
[nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
local proc [[44301,1],0]
[nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
local proc [[44301,1],1]
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
[plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
[plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
[plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
[nano_06]I am #0/2
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06]I am #1/2
[plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
[plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_recv: received sync from local
proc [[44301,1],1]
[nano_06:05166] [[44301,0],1] orted_recv: received sync from local
proc [[44301,1],0]
 (Here it hangs)

Some don't even get to execute:
[jody_at_plankton neander]$ mpirun -np 2 --host nano_01 --debug-daemons ./MPITest
Daemon was launched on nano_01 - beginning to initialize
Daemon [[44293,0],1] checking in as pid 5044 on host nano_01
Daemon [[44293,0],1] not using static ports
[nano_01:05044] [[44293,0],1] orted: up and running - waiting for commands!
[plankton:23867] [[44293,0],0] node[0].name plankton daemon 0 arch ffca0200
[plankton:23867] [[44293,0],0] node[1].name nano_01 daemon 1 arch ffca0200
[plankton:23867] [[44293,0],0] orted_cmd: received add_local_procs
[nano_01:05044] [[44293,0],1] node[0].name plankton daemon 0 arch ffca0200
[nano_01:05044] [[44293,0],1] node[1].name nano_01 daemon 1 arch ffca0200
[nano_01:05044] [[44293,0],1] orted_cmd: received add_local_procs
[nano_01:05044] [[44293,0],1] orted_recv: received sync+nidmap from
local proc [[44293,1],0]
[nano_01:05044] [[44293,0],1] orted_cmd: received collective data cmd
 (Here it hangs)

When i call one of the bad nodes with only 1 processor and debug-daemons,
it works fine (output & clean completion), but without debug-daemons it hangs.
But sometimes there is a crash (not always reproducible):

[jody_at_plankton neander]$ mpirun -np 1 --host nano_04 --debug-daemons ./MPITest
Daemon was launched on nano_04 - beginning to initialize
Daemon [[44431,0],1] checking in as pid 5333 on host nano_04
Daemon [[44431,0],1] not using static ports
[plankton:23985] [[44431,0],0] node[0].name plankton daemon 0 arch ffca0200
[plankton:23985] [[44431,0],0] node[1].name nano_04 daemon 1 arch ffca0200
[plankton:23985] [[44431,0],0] orted_cmd: received add_local_procs
[nano_04:05333] [[44431,0],1] orted: up and running - waiting for commands!
[nano_04:05333] [[44431,0],1] node[0].name plankton daemon 0 arch ffca0200
[nano_04:05333] [[44431,0],1] node[1].name nano_04 daemon 1 arch ffca0200
[nano_04:05333] [[44431,0],1] orted_cmd: received add_local_procs
[nano_04:05333] [[44431,0],1] orted_recv: received sync+nidmap from
local proc [[44431,1],0]
[nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
[plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
[plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
[plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
[plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
[nano_04]I am #0/1
[plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
[plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_recv: received sync from local
proc [[44431,1],0]
[nano_04:05333] [[44431,0],1] orted_cmd: received iof_complete cmd
[nano_04:05333] [[44431,0],1] orted_cmd: received waitpid_fired cmd
[plankton:23985] [[44431,0],0] orted_cmd: received exit
[nano_04:05333] [[44431,0],1] orted_cmd: received exit
[nano_04:05333] [[44431,0],1] orted: finalizing
[nano_04:05333] *** Process received signal ***
[nano_04:05333] Signal: Segmentation fault (11)
[nano_04:05333] Signal code: Address not mapped (1)
[nano_04:05333] Failing at address: 0xb7493e20
[nano_04:05333] [ 0] [0xffffe40c]
[nano_04:05333] [ 1]
/opt/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xb7e65417]
[nano_04:05333] [ 2]
/opt/openmpi/lib/libopen-pal.so.0(opal_event_dispatch+0x1e)
[0xb7e6543e]
[nano_04:05333] [ 3]
/opt/openmpi/lib/libopen-rte.so.0(orte_daemon+0x761) [0xb7ed3d71]
[nano_04:05333] [ 4] orted [0x80487b4]
[nano_04:05333] [ 5] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7cc060c]
[nano_04:05333] [ 6] orted [0x8048691]
[nano_04:05333] *** End of error message ***

Is that perhaps a consequence of configuring with --enable-mpi-threads
and --enable-progress-threads?

Thank You
  Jody