Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] trunk hang (when remote orted has to spawn another orted?)
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-05-08 00:35:59


Here is another trunk hang. I get it if I use at least three remote
nodes. E.g., with r26385:

% mpirun -H remoteA,remoteB,remoteC -n 2 hostname
[remoteA:20508] [[54625,0],1] ORTE_ERROR_LOG: Not found in file
base/ess_base_fns.c at line 135
[remoteA:20508] [[54625,0],1] unable to get hostname for daemon 3
[remoteA:20508] [[54625,0],1] ORTE_ERROR_LOG: Not found in file
orted/orted_comm.c at line 345
[hang]

I think the problem first appeared with r26359.

I guess if a remote orted has to spawn another orted, it gets here:

   opal_pointer_array_get_item(table = 0x7e410, element_index = 3), line
136 in "opal_pointer_array.h"
   find_proc(proc = 0xffbff264), line 51 in "ess_base_fns.c"
   orte_ess_base_proc_get_hostname(proc = 0xffbff264), line 134 in
"ess_base_fns.c"
   remote_spawn(launch = 0x85f30), line 812 in "plm_rsh_module.c"
   orte_daemon_recv(status = 0, sender = 0x85f54, buffer = 0x85f30, tag
= 1U, cbdata = (nil)), line 344 in "orted_comm.c"
   orte_rml_recv_msg_callback(status = 0, peer = 0x69014, iov = 0x7d7e0,
count = 2, tag = 1U, cbdata = 0x85ec0), line 68 in "rml_oob_recv.c"
   mca_oob_tcp_msg_data(msg = 0x85310, peer = 0x69000), line 436 in
"oob_tcp_msg.c"
   mca_oob_tcp_msg_recv_complete(msg = 0x85310, peer = 0x69000), line
322 in "oob_tcp_msg.c"
   mca_oob_tcp_peer_recv_handler(sd = 13, flags = 2, user = 0x69000),
line 942 in "oob_tcp_peer.c"
   event_persist_closure(base = 0x3c600, ev = 0x647a8), line 1280 in
"event.c"
   event_process_active_single_queue(base = 0x3c600, activeq = 0x3c4f0),
line 1324 in "event.c"
   event_process_active(base = 0x3c600), line 1396 in "event.c"
   opal_libevent2013_event_base_loop(base = 0x3c600, flags = 1), line
1593 in "event.c"
   orte_daemon(argc = 19, argv = 0xffbff97c), line 729 in "orted_main.c"
   main(argc = 19, argv = 0xffbff97c), line 62 in "orted.c"

So, in my case, I'm trying to look up item 3 while only item 1 in the
array appears to be initialized.