Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-09-22 20:52:46


Johnathan --

Sorry for the delay in replying; thanks for posting again.

I'm actually unable to replicate your problem. :-( I have a new
intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI
1.3.2 and 1.3.3 and am not seeing the problem you're seeing. I even
made your sample program worse -- I made a and b be 100,000 element
real arrays (increasing the count args in MPI_SENDRECV to 100,000 as
well), and increased nsteps to 150,000,000. No hangs. :-\

The version of the compiler *usually* isn't significant, so gcc 4.x
should be fine.

Yes, the sm flow control issue was a significant fix, but the blocking
nature of MPI_SENDRECV means that you shouldn't have run into the
problems that were fixed (the main issues had to do with fast senders
exhausting resources of slow receivers -- but MPI_SENDRECV is
synchronous so the senders should always be matching the speed of the
receivers).

Just for giggles, what happens if you change

       if (leftneighbour .eq. -1) then
          leftneighbour = nprocs-1
       endif
       if (rightneighbour .eq. nprocs) then
          rightneighbour = 0
       endif

to

       if (leftneighbour .eq. -1) then
          leftneighbour = MPI_PROC_NULL
       endif
       if (rightneighbour .eq. nprocs) then
          rightneighbour = MPI_PROC_NULL
       endif

On Sep 21, 2009, at 5:09 PM, Jonathan Dursi wrote:

> Continuing the conversation with myself:
>
> Google pointed me to Trac ticket #1944, which spoke of deadlocks in
> looped collective operations; there is no collective operation
> anywhere in this sample code, but trying one of the suggested
> workarounds/clues: that is, setting btl_sm_num_fifos to at least
> (np-1) seems to make things work quite reliably, for both OpenMPI
> 1.3.2 and 1.3.3; that is, while this
>
> mpirun -np 6 -mca btl sm,self ./diffusion-mpi
>
> invariably hangs (at random-seeming numbers of iterations) with
> OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
> seemingly randomly) with 1.3.3,
>
> mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
>
> or
>
> mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
>
> always succeeds, with (as one might guess) the second being much
> faster...
>
> Jonathan
>
> --
> Jonathan Dursi <ljdursi_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]