Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Routed 'unity' broken on trunk
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-03-31 09:57:35


Ralph,

I've just noticed that it seems that the 'unity' routed component
seems to be broken when using more than one machine. I'm using Odin
and r18028 of the trunk, and have confirmed that this problem occurs
with SLURM and rsh. I think this break came in on Friday as that is
when some of my MTT tests started to hang and fail, but I cannot point
to a specific revision at this point. The backtraces (enclosed) of the
processes point to the grpcomm allgather routine.

The 'noop' program calls MPI_Init, sleeps, then calls MPI_Finalize.

RSH example from odin023 - so no SLURM variables:
These work:
  shell$ mpirun -np 2 -host odin023 noop -v 1
  shell$ mpirun -np 2 -host odin023,odin024 noop -v 1
  shell$ mpirun -np 2 -mca routed unity -host odin023 noop -v 1

This hangs:
  shell$ mpirun -np 2 -mca routed unity -host odin023,odin024 noop -v 1

If I attach to the 'noop' process on odin023 I get the following
backtrace:
------------------------------------------------
(gdb) bt
#0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
#1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b330,
maxevents=1023, timeout=1000) at epoll_sub.c:61
#2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506c30, arg=0x506910,
tv=0x7fbfffe840) at epoll.c:210
#3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506c30,
flags=5) at event.c:779
#4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
#5 0x0000002a95a0bef8 in opal_progress () at runtime/opal_progress.c:
169
#6 0x0000002a958b9e48 in orte_grpcomm_base_allgather
(sbuf=0x7fbfffeae0, rbuf=0x7fbfffea80) at base/
grpcomm_base_allgather.c:238
#7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at base/
grpcomm_base_modex.c:413
#8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffed58,
requested=0, provided=0x7fbfffec38) at runtime/ompi_mpi_init.c:510
#9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffec7c,
argv=0x7fbfffec70) at pinit.c:88
#10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffed58) at noop.c:39
------------------------------------------------

The 'noop' process on odin024 has a similar backtrace:
------------------------------------------------
(gdb) bt
#0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
#1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b390,
maxevents=1023, timeout=1000) at epoll_sub.c:61
#2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506cc0, arg=0x506c20,
tv=0x7fbfffe9d0) at epoll.c:210
#3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506cc0,
flags=5) at event.c:779
#4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
#5 0x0000002a95a0bef8 in opal_progress () at runtime/opal_progress.c:
169
#6 0x0000002a958b97c5 in orte_grpcomm_base_allgather
(sbuf=0x7fbfffec70, rbuf=0x7fbfffec10) at base/
grpcomm_base_allgather.c:163
#7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at base/
grpcomm_base_modex.c:413
#8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffeee8,
requested=0, provided=0x7fbfffedc8) at runtime/ompi_mpi_init.c:510
#9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffee0c,
argv=0x7fbfffee00) at pinit.c:88
#10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffeee8) at noop.c:39
------------------------------------------------

Cheers,
Josh