Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Looped Barrier/Sendrecv hangs with btl sml: OMP1.3.3, 1.3.2, gcc44, intel 11
From: Jonathan Dursi (ljdursi_at_[hidden])
Date: 2009-09-26 14:24:11


Hi, Eugene:

Thanks for your efforts in reproducing this problem; glad to know it's
not just us.

I think our solution for now is just to migrate our users to MVAPICH2
and Intel MPI; these MPICH-based systems work for us and our users
extremely reliably, and it just looks like OpenMPI isn't ready for
real production use on our system.

        - Jonathan

On 2009-09-24, at 4:16PM, Eugene Loh wrote:

> Jonathan Dursi wrote:
>
>> So to summarize:
>>
>> OpenMPI 1.3.2 + gcc4.4.0
>>
>> Test problem with periodic (left neighbour of proc 0 is proc N-1)
>> Sendrecv()s:
>> Default always hangs in Sendrecv after random number of iterations
>> Turning off sm (-mca btl self,tcp) not observed to hang
>> Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to
>> hang
>> Using fewer than 5 fifos hangs in Sendrecv after random number of
>> iterations or Finalize
>>
>> OpenMPI 1.3.3 + gcc4.4.0
>>
>> Test problem with periodic (left neighbour of proc 0 is proc N-1)
>> Sendrecv()s:
>> Default sometimes (~20% of time) hangs in Sendrecv after random
>> number of iterations
>> Turning off sm (-mca btl self,tcp) not observed to hang
>> Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to
>> hang
>> Using fewer than 5 fifos but more than 2 not observed to hang
>> Using 2 fifos sometimes (~20% of time) hangs in Finalize or
>> Sendrecv after random number of iterations but sometimes completes
>>
>> OpenMPI 1.3.2 + intel 11.0 compilers
>>
>> We are seeing a problem which we believe to be related; ~1% of
>> certain single-node jobs hang, turning off sm or setting num_fifos
>> to NP-1 eliminates this.
>
> I can reproduce this with just Barriers, which keeps the processes
> all in sync. So, this has nothing to do with processes outrunning
> one another (which wasn't likely in the first place given that you
> had Sendrecv calls).
>
> The problem is fickle. E.g., building OMPI with -g seems to make
> the problem go away.
>
> I did observe that the sm FIFO would fill up. That's weird since
> there aren't ever a lot of in-flight messages. I tried adding a
> line of code that would make a process pause if ever it tried to
> write to a FIFO that seemed full. That pretty much made the problem
> go away. So, I guess it's a memory coherency problem: receive
> clears the FIFO, but writer thinks it's congested.
>
> I tried all sorts of GCC compilers. The problem seems to set in
> with 4.4.0. I don't know what's significant about that. It
> requires moving to the 2.18 assembler, but I tried the 2.18
> assembler with 4.3.3 and that worked okay. I'd think this has to do
> with GCC 4.4.x, but you say you see the problem with Intel compilers
> as well. Hmm. Maybe an OMPI problem that's better exposed with GCC
> 4.4.x?
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jonathan Dursi <ljdursi_at_[hidden]>