Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-06-11 09:22:15


It's the --enable-progress-threads flag that causes the problem - we
don't really support that yet. Maybe someday.

Take that out and you should be okay, with the caveats expressed on
the OMPI web site (i.e., not everything works with threads yet).

On Jun 11, 2009, at 4:56 AM, jody wrote:

> More info:
> I checked and found that not all nodes are equal:
> the ones that don't work have mpi-threads *and* progress-threads
> enabled,
> whereas the ones that work have only mpi-threads enabled
>
> Is there a problem when both thread-types are enabled?
>
> Jody
>
> On Thu, Jun 11, 2009 at 12:19 PM, jody<jody.xha_at_[hidden]> wrote:
>> Hi
>>
>> After updating all my nodes to Open-MPI 1.3.2 (with
>> --enable-mpi-threads some of them fail to execute a simple MPI test
>> program - they seem to hang.
>> With --debug-daemons the application seems to execute (two line os
>> output) but hangs before returning:
>>
>> [jody_at_aplankton neander]$ mpirun -np 2 --host nano_06 --debug-
>> daemons ./MPITest
>> Daemon was launched on nano_06 - beginning to initialize
>> Daemon [[44301,0],1] checking in as pid 5166 on host nano_06
>> Daemon [[44301,0],1] not using static ports
>> [nano_06:05166] [[44301,0],1] orted: up and running - waiting for
>> commands!
>> [plankton:23859] [[44301,0],0] node[0].name plankton daemon 0 arch
>> ffca0200
>> [plankton:23859] [[44301,0],0] node[1].name nano_06 daemon 1 arch
>> ffca0200
>> [plankton:23859] [[44301,0],0] orted_cmd: received add_local_procs
>> [nano_06:05166] [[44301,0],1] node[0].name plankton daemon 0 arch
>> ffca0200
>> [nano_06:05166] [[44301,0],1] node[1].name nano_06 daemon 1 arch
>> ffca0200
>> [nano_06:05166] [[44301,0],1] orted_cmd: received add_local_procs
>> [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
>> local proc [[44301,1],0]
>> [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
>> local proc [[44301,1],1]
>> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
>> [plankton:23859] [[44301,0],0] orted_cmd: received collective data
>> cmd
>> [plankton:23859] [[44301,0],0] orted_cmd: received
>> message_local_procs
>> [plankton:23859] [[44301,0],0] orted_cmd: received collective data
>> cmd
>> [plankton:23859] [[44301,0],0] orted_cmd: received
>> message_local_procs
>> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
>> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
>> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
>> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
>> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
>> [nano_06]I am #0/2
>> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
>> [nano_06]I am #1/2
>> [plankton:23859] [[44301,0],0] orted_cmd: received collective data
>> cmd
>> [plankton:23859] [[44301,0],0] orted_cmd: received
>> message_local_procs
>> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
>> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
>> [nano_06:05166] [[44301,0],1] orted_recv: received sync from local
>> proc [[44301,1],1]
>> [nano_06:05166] [[44301,0],1] orted_recv: received sync from local
>> proc [[44301,1],0]
>> (Here it hangs)
>>
>> Some don't even get to execute:
>> [jody_at_plankton neander]$ mpirun -np 2 --host nano_01 --debug-
>> daemons ./MPITest
>> Daemon was launched on nano_01 - beginning to initialize
>> Daemon [[44293,0],1] checking in as pid 5044 on host nano_01
>> Daemon [[44293,0],1] not using static ports
>> [nano_01:05044] [[44293,0],1] orted: up and running - waiting for
>> commands!
>> [plankton:23867] [[44293,0],0] node[0].name plankton daemon 0 arch
>> ffca0200
>> [plankton:23867] [[44293,0],0] node[1].name nano_01 daemon 1 arch
>> ffca0200
>> [plankton:23867] [[44293,0],0] orted_cmd: received add_local_procs
>> [nano_01:05044] [[44293,0],1] node[0].name plankton daemon 0 arch
>> ffca0200
>> [nano_01:05044] [[44293,0],1] node[1].name nano_01 daemon 1 arch
>> ffca0200
>> [nano_01:05044] [[44293,0],1] orted_cmd: received add_local_procs
>> [nano_01:05044] [[44293,0],1] orted_recv: received sync+nidmap from
>> local proc [[44293,1],0]
>> [nano_01:05044] [[44293,0],1] orted_cmd: received collective data cmd
>> (Here it hangs)
>>
>> When i call one of the bad nodes with only 1 processor and debug-
>> daemons,
>> it works fine (output & clean completion), but without debug-
>> daemons it hangs.
>> But sometimes there is a crash (not always reproducible):
>>
>> [jody_at_plankton neander]$ mpirun -np 1 --host nano_04 --debug-
>> daemons ./MPITest
>> Daemon was launched on nano_04 - beginning to initialize
>> Daemon [[44431,0],1] checking in as pid 5333 on host nano_04
>> Daemon [[44431,0],1] not using static ports
>> [plankton:23985] [[44431,0],0] node[0].name plankton daemon 0 arch
>> ffca0200
>> [plankton:23985] [[44431,0],0] node[1].name nano_04 daemon 1 arch
>> ffca0200
>> [plankton:23985] [[44431,0],0] orted_cmd: received add_local_procs
>> [nano_04:05333] [[44431,0],1] orted: up and running - waiting for
>> commands!
>> [nano_04:05333] [[44431,0],1] node[0].name plankton daemon 0 arch
>> ffca0200
>> [nano_04:05333] [[44431,0],1] node[1].name nano_04 daemon 1 arch
>> ffca0200
>> [nano_04:05333] [[44431,0],1] orted_cmd: received add_local_procs
>> [nano_04:05333] [[44431,0],1] orted_recv: received sync+nidmap from
>> local proc [[44431,1],0]
>> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
>> [plankton:23985] [[44431,0],0] orted_cmd: received collective data
>> cmd
>> [plankton:23985] [[44431,0],0] orted_cmd: received
>> message_local_procs
>> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
>> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
>> [plankton:23985] [[44431,0],0] orted_cmd: received collective data
>> cmd
>> [plankton:23985] [[44431,0],0] orted_cmd: received
>> message_local_procs
>> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
>> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
>> [nano_04]I am #0/1
>> [plankton:23985] [[44431,0],0] orted_cmd: received collective data
>> cmd
>> [plankton:23985] [[44431,0],0] orted_cmd: received
>> message_local_procs
>> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
>> [nano_04:05333] [[44431,0],1] orted_recv: received sync from local
>> proc [[44431,1],0]
>> [nano_04:05333] [[44431,0],1] orted_cmd: received iof_complete cmd
>> [nano_04:05333] [[44431,0],1] orted_cmd: received waitpid_fired cmd
>> [plankton:23985] [[44431,0],0] orted_cmd: received exit
>> [nano_04:05333] [[44431,0],1] orted_cmd: received exit
>> [nano_04:05333] [[44431,0],1] orted: finalizing
>> [nano_04:05333] *** Process received signal ***
>> [nano_04:05333] Signal: Segmentation fault (11)
>> [nano_04:05333] Signal code: Address not mapped (1)
>> [nano_04:05333] Failing at address: 0xb7493e20
>> [nano_04:05333] [ 0] [0xffffe40c]
>> [nano_04:05333] [ 1]
>> /opt/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xb7e65417]
>> [nano_04:05333] [ 2]
>> /opt/openmpi/lib/libopen-pal.so.0(opal_event_dispatch+0x1e)
>> [0xb7e6543e]
>> [nano_04:05333] [ 3]
>> /opt/openmpi/lib/libopen-rte.so.0(orte_daemon+0x761) [0xb7ed3d71]
>> [nano_04:05333] [ 4] orted [0x80487b4]
>> [nano_04:05333] [ 5] /lib/libc.so.6(__libc_start_main+0xdc)
>> [0xb7cc060c]
>> [nano_04:05333] [ 6] orted [0x8048691]
>> [nano_04:05333] *** End of error message ***
>>
>>
>>
>>
>> Is that perhaps a consequence of configuring with --enable-mpi-
>> threads
>> and --enable-progress-threads?
>>
>> Thank You
>> Jody
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users