Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-09-24 16:16:15


Jonathan Dursi wrote:

> So to summarize:
>
> OpenMPI 1.3.2 + gcc4.4.0
>
> Test problem with periodic (left neighbour of proc 0 is proc N-1)
> Sendrecv()s:
> Default always hangs in Sendrecv after random number of iterations
> Turning off sm (-mca btl self,tcp) not observed to hang
> Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
> Using fewer than 5 fifos hangs in Sendrecv after random number of
> iterations or Finalize
>
> OpenMPI 1.3.3 + gcc4.4.0
>
> Test problem with periodic (left neighbour of proc 0 is proc N-1)
> Sendrecv()s:
> Default sometimes (~20% of time) hangs in Sendrecv after random
> number of iterations
> Turning off sm (-mca btl self,tcp) not observed to hang
> Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
> Using fewer than 5 fifos but more than 2 not observed to hang
> Using 2 fifos sometimes (~20% of time) hangs in Finalize or
> Sendrecv after random number of iterations but sometimes completes
>
> OpenMPI 1.3.2 + intel 11.0 compilers
>
> We are seeing a problem which we believe to be related; ~1% of
> certain single-node jobs hang, turning off sm or setting num_fifos to
> NP-1 eliminates this.

I can reproduce this with just Barriers, which keeps the processes all
in sync. So, this has nothing to do with processes outrunning one
another (which wasn't likely in the first place given that you had
Sendrecv calls).

The problem is fickle. E.g., building OMPI with -g seems to make the
problem go away.

I did observe that the sm FIFO would fill up. That's weird since there
aren't ever a lot of in-flight messages. I tried adding a line of code
that would make a process pause if ever it tried to write to a FIFO that
seemed full. That pretty much made the problem go away. So, I guess
it's a memory coherency problem: receive clears the FIFO, but writer
thinks it's congested.

I tried all sorts of GCC compilers. The problem seems to set in with
4.4.0. I don't know what's significant about that. It requires moving
to the 2.18 assembler, but I tried the 2.18 assembler with 4.3.3 and
that worked okay. I'd think this has to do with GCC 4.4.x, but you say
you see the problem with Intel compilers as well. Hmm. Maybe an OMPI
problem that's better exposed with GCC 4.4.x?