Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
From: Jonathan Dursi (ljdursi_at_[hidden])
Date: 2009-09-23 08:46:15


Hi, Eugene:

If it continues to be a problem for people to reproduce this, I'll see
what can be done about having an account made here for someone to poke
around. Alternately, any suggestions for tests that I can do to help
diagnose/verify the problem, or figure out whats different about this
setup would be greatly appreciated.

As re the btl_sm_num_fifos thing, it could be a bit of a red herring,
it's just something I started to use following one of the previous bug
reports. However, it changes the behaviour pretty markedly - with
the sample program I submitted (eg, the send recvs looping around),
and with OpenMPI 1.3.2 (the version where I see the most extreme
problems, eg things fail every run), this always works

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

and other larger numbers for num_fifos also seems to reliably work,
but 4 or less

mpirun -np 6 -mca btl_sm_num_fifos 4 -mca btl sm,self ./diffusion-mpi

always hangs, as before - after some number of iterations, sometimes
fewer, sometimes more, always somewhere in the MPI_Sendrecv:
(gdb) where
#0 0x00002b9b0a661e80 in opal_progress_at_plt () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#1 0x00002b9b0a67e345 in ompi_request_default_wait () from /scinet/
gpc/mpi/openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#2 0x00002b9b0a6a42c0 in PMPI_Sendrecv () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#3 0x00002b9b0a43c540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#4 0x0000000000400eab in MAIN__ ()
#5 0x0000000000400fda in main (argc=1, argv=0x7fffb92cc078)
at ../../../gcc-4.4.0/libgfortran/fmain.c:21

On the other hand, if I set the leftmost and rightmost neighbours to
MPI_PROC_NULL as Jeff requested, the behaviour changes; any number
greater than two works

mpirun -np 6 -mca btl_sm_num_fifos 3 -mca btl sm,self ./diffusion-mpi

But the btl_sm_num_fifos 2 always hangs, either in the Sendrecv or in
the Finalize

mpirun -np 6 -mca btl_sm_num_fifos 2 -mca btl sm,self ./diffusion-mpi

And the default always hangs, usually in the Finalize but sometimes in
the Sendrecv.

mpirun -np 6 -mca btl sm,self ./diffusion-mpi
(gdb) where
#0 0x00002ad54846d51f in poll () from /lib64/libc.so.6
#1 0x00002ad54717a7c1 in poll_dispatch () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2 0x00002ad547179659 in opal_event_base_loop () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#3 0x00002ad54716e189 in opal_progress () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#4 0x00002ad54931ef15 in barrier () from /scinet/gpc/mpi/openmpi/
1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so
#5 0x00002ad546ca358b in ompi_mpi_finalize () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#6 0x00002ad546a5d529 in pmpi_finalize__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#7 0x0000000000400f99 in MAIN__ ()

So to summarize:

OpenMPI 1.3.2 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
  Default always hangs in Sendrecv after random number of iterations
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos hangs in Sendrecv after random number of
iterations or Finalize

Test problem with non-periodic (left neighbour of proc 0 is
MPI_PROC_NULL) Sendrecv()s:
  Default always hangs, in Sendrecv after random number of iterations
or at Finalize
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos but more than 2 not observed to hang
  Using 2 fifos hangs in Finalize or Sendrecv after random number of
iterations

OpenMPI 1.3.3 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
  Default sometimes (~20% of time) hangs in Sendrecv after random
number of iterations
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos but more than 2 not observed to hang
  Using 2 fifos sometimes (~20% of time) hangs in Finalize or
Sendrecv after random number of iterations but sometimes completes

Test problem with non-periodic (left neighbour of proc 0 is
MPI_PROC_NULL) Sendrecv()s:
  Default usually (~75% of time) hangs, in Finalize or in Sendrecv
after random number of iterations
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos but more than 2 not observed to hang
  Using 2 fifos usually (~75% of time) hangs in Finalize or Sendrecv
after random number of iterations but sometimes completes

OpenMPI 1.3.2 + intel 11.0 compilers

We are seeing a problem which we believe to be related; ~1% of certain
single-node jobs hang, turning off sm or setting num_fifos to NP-1
eliminates this.

    - Jonathan

-- 
Jonathan Dursi <ljdursi_at_[hidden]>