Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] freezing in mpi_allreduce operation
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-09-24 09:35:46


Holy crimminey, I'm totally lost in your Fortran syntax. :-)

What you describe might be a bug in our MPI_IN_PLACE handling for MPI_ALLREDUCE.

Could you possible make a small test case that a) we can run, and b) uses straightforward Fortran? (avoid using terms like "assumed shape" and "assumed size" and ...any other Fortran stuff that confuses simple C programmers like us :-) )

What version of Open MPI is this?

On Sep 8, 2011, at 5:59 PM, Greg Fischer wrote:

> Note also that coding the mpi_allreduce as:
>
> call mpi_allreduce(MPI_IN_PLACE,phim(0,1,1,1,grp),phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
>
> results in the same freezing behavior in the 60th iteration. (I don't recall why the arrays were being passed, possibly just a mistake.)
>
>
> On Thu, Sep 8, 2011 at 4:17 PM, Greg Fischer <greg.a.fischer_at_[hidden]> wrote:
> I am seeing mpi_allreduce operations freeze execution of my code on some moderately-sized problems. The freeze does not manifest itself in every problem. In addition, it is in a portion of the code that is repeated many times. In the problem discussed below, the problem appears in the 60th iteration.
>
> The current test case that I'm looking at is a 64-processor job. This particular mpi_allreduce call applies to all 64 processors, with each communicator in the call containing a total of 4 processors. When I add print statements before and after the offending line, I see that all 64 processors successfully make it to the mpi_allreduce call, but only 32 successfully exit. Stack traces on the other 32 yield something along the lines of the trace listed at the bottom of this message. The call, itself, looks like:
>
> call mpi_allreduce(MPI_IN_PLACE, phim(0:(phim_size-1),1:im,1:jm,1:kmloc(coords(2)+1),grp), &
> phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
>
> These messages are sized to remain under the 32-bit integer size limitation for the "count" parameter. The intent is to perform the allreduce operation on a contiguous block of the array. Previously, I had been passing an assumed-shape array (i.e. phim(:,:,:,:,grp), but found some documentation indicating that was potentially dangerous. Making the change from assumed- to explicit-shaped arrays doesn't solve the problem. However, if I declare an additional array and use separate send and receive buffers:
>
> call mpi_allreduce(phim_local,phim_global,phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
> phim(:,:,:,:,grp) = phim_global
>
> Then the problem goes away, and every thing works normally. Does anyone have any insight as to what may be happening here? I'm using "include 'mpif.h'" rather than the f90 module, does that potentially explain this?
>
> Thanks,
> Greg
>
> Stack trace(s) for thread: 1
> -----------------
> [0] (1 processes)
> -----------------
> main() at ?:?
> solver() at solver.f90:31
> solver_q_down() at solver_q_down.f90:52
> iter() at iter.f90:56
> mcalc() at mcalc.f90:38
> pmpi_allreduce__() at ?:?
> PMPI_Allreduce() at ?:?
> ompi_coll_tuned_allreduce_intra_dec_fixed() at ?:?
> ompi_coll_tuned_allreduce_intra_ring_segmented() at ?:?
> ompi_coll_tuned_sendrecv_actual() at ?:?
> ompi_request_default_wait_all() at ?:?
> opal_progress() at ?:?
> Stack trace(s) for thread: 2
> -----------------
> [0] (1 processes)
> -----------------
> start_thread() at ?:?
> btl_openib_async_thread() at ?:?
> poll() at ?:?
> Stack trace(s) for thread: 3
> -----------------
> [0] (1 processes)
> -----------------
> start_thread() at ?:?
> service_thread_start() at ?:?
> select() at ?:?
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/