Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Looped Barrier/Sendrecv hangs with btl sml: OMP1.3.3, 1.3.2, gcc44, intel 11
From: Jonathan Dursi (ljdursi_at_[hidden])
Date: 2009-09-26 14:24:11


Hi, Eugene:

Thanks for your efforts in reproducing this problem; glad to know it's
not just us.

I think our solution for now is just to migrate our users to MVAPICH2
and Intel MPI; these MPICH-based systems work for us and our users
extremely reliably, and it just looks like OpenMPI isn't ready for
real production use on our system.

        - Jonathan

On 2009-09-24, at 4:16PM, Eugene Loh wrote:

> Jonathan Dursi wrote:
>
>> So to summarize:
>>
>> OpenMPI 1.3.2 + gcc4.4.0
>>
>> Test problem with periodic (left neighbour of proc 0 is proc N-1)
>> Sendrecv()s:
>> Default always hangs in Sendrecv after random number of iterations
>> Turning off sm (-mca btl self,tcp) not observed to hang
>> Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to
>> hang
>> Using fewer than 5 fifos hangs in Sendrecv after random number of
>> iterations or Finalize
>>
>> OpenMPI 1.3.3 + gcc4.4.0
>>
>> Test problem with periodic (left neighbour of proc 0 is proc N-1)
>> Sendrecv()s:
>> Default sometimes (~20% of time) hangs in Sendrecv after random
>> number of iterations
>> Turning off sm (-mca btl self,tcp) not observed to hang
>> Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to
>> hang
>> Using fewer than 5 fifos but more than 2 not observed to hang
>> Using 2 fifos sometimes (~20% of time) hangs in Finalize or
>> Sendrecv after random number of iterations but sometimes completes
>>
>> OpenMPI 1.3.2 + intel 11.0 compilers
>>
>> We are seeing a problem which we believe to be related; ~1% of
>> certain single-node jobs hang, turning off sm or setting num_fifos
>> to NP-1 eliminates this.
>
> I can reproduce this with just Barriers, which keeps the processes
> all in sync. So, this has nothing to do with processes outrunning
> one another (which wasn't likely in the first place given that you
> had Sendrecv calls).
>
> The problem is fickle. E.g., building OMPI with -g seems to make
> the problem go away.
>
> I did observe that the sm FIFO would fill up. That's weird since
> there aren't ever a lot of in-flight messages. I tried adding a
> line of code that would make a process pause if ever it tried to
> write to a FIFO that seemed full. That pretty much made the problem
> go away. So, I guess it's a memory coherency problem: receive
> clears the FIFO, but writer thinks it's congested.
>
> I tried all sorts of GCC compilers. The problem seems to set in
> with 4.4.0. I don't know what's significant about that. It
> requires moving to the 2.18 assembler, but I tried the 2.18
> assembler with 4.3.3 and that worked okay. I'd think this has to do
> with GCC 4.4.x, but you say you see the problem with Intel compilers
> as well. Hmm. Maybe an OMPI problem that's better exposed with GCC
> 4.4.x?
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jonathan Dursi <ljdursi_at_[hidden]>