Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] hang with launch including remote nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-06-21 07:18:18


Got it! Will take a little thinking to fix - it's basically a conflict between rollup and tree spawn. For now, you can run with:

-mca orte_use_common_port 0 -mca plm_rsh_no_tree_spawn 1

Sorry about that - thanks for letting me know!
Ralph

On Jun 20, 2012, at 9:48 PM, Eugene Loh wrote:

> On 06/19/12 23:11, Ralph Castain wrote:
>> Also, how did you configure this version?
> --enable-heterogeneous
> --enable-cxx-exceptions
> --enable-shared
> --enable-orterun-prefix-by-default
> --with-sge
> --enable-mpi-f90
> --with-mpi-f90-size=small
> --disable-peruse
> --disable-mpi-thread-multiple
> --disable-debug
> --disable-mem-debug
> --disable-mem-profile
> --enable-contrib-no-build=vt
>
>> If you had --disable-static, then there was a bug that would indeed cause a hang. Just committing that fix now.
> I still get a hang even with r26623.
>> On Jun 19, 2012, at 9:01 PM, Ralph Castain wrote:
>>> See if it works with -mca orte_use_common_port 0
>
> I get a segfault:
>
> [remote1:01409] *** Process received signal ***
> [remote1:01409] Signal: Segmentation Fault (11)
> [remote1:01409] Signal code: Address not mapped (1)
> [remote1:01409] Failing at address: 2c
> /home/eugene/r26609/lib/libopen-rte.so.0.0.0'show_stackframe+0x7d0
> /lib/amd64/libc.so.1'__sighndlr+0x6
> /lib/amd64/libc.so.1'call_user_handler+0x2c5
> /home/eugene/r26609/lib/libopen-rte.so.0.0.0'orte_grpcomm_base_rollup_recv+0x73 [Signal 11 (SEGV)]
> /home/eugene/r26609/lib/openmpi/mca_rml_oob.so'orte_rml_recv_msg_callback+0x9c
> /home/eugene/r26609/lib/openmpi/mca_oob_tcp.so'mca_oob_tcp_msg_data+0x283
> /home/eugene/r26609/lib/libopen-rte.so.0.0.0'event_process_active_single_queue+0x54c
> /home/eugene/r26609/lib/libopen-rte.so.0.0.0'event_process_active+0x41
> /home/eugene/r26609/lib/libopen-rte.so.0.0.0'opal_libevent2019_event_base_loop+0x606
> /home/eugene/r26609/lib/libopen-rte.so.0.0.0'orte_daemon+0xd6d
> /home/eugene/r26609/bin/orted'0xd4b
> [remote1:01409] *** End of error message ***
> Segmentation Fault (core dumped)
>
>>>
>>> On Jun 19, 2012, at 8:31 PM, Eugene Loh wrote:
>>>> I'm having bad luck with the trunk starting with r26609. Basically, things hang if I run
>>>>
>>>> mpirun -H remote1,remote2 -n 2 hostname
>>>>
>>>> where remote1 and remote2 are remote nodes.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel