Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
From: Jonathan Dursi (ljdursi_at_[hidden])
Date: 2009-09-23 08:46:15


Hi, Eugene:

If it continues to be a problem for people to reproduce this, I'll see
what can be done about having an account made here for someone to poke
around. Alternately, any suggestions for tests that I can do to help
diagnose/verify the problem, or figure out whats different about this
setup would be greatly appreciated.

As re the btl_sm_num_fifos thing, it could be a bit of a red herring,
it's just something I started to use following one of the previous bug
reports. However, it changes the behaviour pretty markedly - with
the sample program I submitted (eg, the send recvs looping around),
and with OpenMPI 1.3.2 (the version where I see the most extreme
problems, eg things fail every run), this always works

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

and other larger numbers for num_fifos also seems to reliably work,
but 4 or less

mpirun -np 6 -mca btl_sm_num_fifos 4 -mca btl sm,self ./diffusion-mpi

always hangs, as before - after some number of iterations, sometimes
fewer, sometimes more, always somewhere in the MPI_Sendrecv:
(gdb) where
#0 0x00002b9b0a661e80 in opal_progress_at_plt () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#1 0x00002b9b0a67e345 in ompi_request_default_wait () from /scinet/
gpc/mpi/openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#2 0x00002b9b0a6a42c0 in PMPI_Sendrecv () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#3 0x00002b9b0a43c540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#4 0x0000000000400eab in MAIN__ ()
#5 0x0000000000400fda in main (argc=1, argv=0x7fffb92cc078)
at ../../../gcc-4.4.0/libgfortran/fmain.c:21

On the other hand, if I set the leftmost and rightmost neighbours to
MPI_PROC_NULL as Jeff requested, the behaviour changes; any number
greater than two works

mpirun -np 6 -mca btl_sm_num_fifos 3 -mca btl sm,self ./diffusion-mpi

But the btl_sm_num_fifos 2 always hangs, either in the Sendrecv or in
the Finalize

mpirun -np 6 -mca btl_sm_num_fifos 2 -mca btl sm,self ./diffusion-mpi

And the default always hangs, usually in the Finalize but sometimes in
the Sendrecv.

mpirun -np 6 -mca btl sm,self ./diffusion-mpi
(gdb) where
#0 0x00002ad54846d51f in poll () from /lib64/libc.so.6
#1 0x00002ad54717a7c1 in poll_dispatch () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2 0x00002ad547179659 in opal_event_base_loop () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#3 0x00002ad54716e189 in opal_progress () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#4 0x00002ad54931ef15 in barrier () from /scinet/gpc/mpi/openmpi/
1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so
#5 0x00002ad546ca358b in ompi_mpi_finalize () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#6 0x00002ad546a5d529 in pmpi_finalize__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#7 0x0000000000400f99 in MAIN__ ()

So to summarize:

OpenMPI 1.3.2 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
  Default always hangs in Sendrecv after random number of iterations
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos hangs in Sendrecv after random number of
iterations or Finalize

Test problem with non-periodic (left neighbour of proc 0 is
MPI_PROC_NULL) Sendrecv()s:
  Default always hangs, in Sendrecv after random number of iterations
or at Finalize
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos but more than 2 not observed to hang
  Using 2 fifos hangs in Finalize or Sendrecv after random number of
iterations

OpenMPI 1.3.3 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
  Default sometimes (~20% of time) hangs in Sendrecv after random
number of iterations
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos but more than 2 not observed to hang
  Using 2 fifos sometimes (~20% of time) hangs in Finalize or
Sendrecv after random number of iterations but sometimes completes

Test problem with non-periodic (left neighbour of proc 0 is
MPI_PROC_NULL) Sendrecv()s:
  Default usually (~75% of time) hangs, in Finalize or in Sendrecv
after random number of iterations
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos but more than 2 not observed to hang
  Using 2 fifos usually (~75% of time) hangs in Finalize or Sendrecv
after random number of iterations but sometimes completes

OpenMPI 1.3.2 + intel 11.0 compilers

We are seeing a problem which we believe to be related; ~1% of certain
single-node jobs hang, turning off sm or setting num_fifos to NP-1
eliminates this.

    - Jonathan

-- 
Jonathan Dursi <ljdursi_at_[hidden]>