Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] freezing in mpi_allreduce operation
From: Greg Fischer (greg.a.fischer_at_[hidden])
Date: 2011-09-08 16:17:53


I am seeing mpi_allreduce operations freeze execution of my code on some
moderately-sized problems. The freeze does not manifest itself in every
problem. In addition, it is in a portion of the code that is repeated many
times. In the problem discussed below, the problem appears in the 60th
iteration.

The current test case that I'm looking at is a 64-processor job. This
particular mpi_allreduce call applies to all 64 processors, with each
communicator in the call containing a total of 4 processors. When I add
print statements before and after the offending line, I see that all 64
processors successfully make it to the mpi_allreduce call, but only 32
successfully exit. Stack traces on the other 32 yield something along the
lines of the trace listed at the bottom of this message. The call, itself,
looks like:

 call mpi_allreduce(MPI_IN_PLACE,
phim(0:(phim_size-1),1:im,1:jm,1:kmloc(coords(2)+1),grp), &

phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)

These messages are sized to remain under the 32-bit integer size limitation
for the "count" parameter. The intent is to perform the allreduce operation
on a contiguous block of the array. Previously, I had been passing an
assumed-shape array (i.e. phim(:,:,:,:,grp), but found some documentation
indicating that was potentially dangerous. Making the change from assumed-
to explicit-shaped arrays doesn't solve the problem. However, if I declare
an additional array and use separate send and receive buffers:

 call
mpi_allreduce(phim_local,phim_global,phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
 phim(:,:,:,:,grp) = phim_global

Then the problem goes away, and every thing works normally. Does anyone
have any insight as to what may be happening here? I'm using "include
'mpif.h'" rather than the f90 module, does that potentially explain this?

Thanks,
Greg

Stack trace(s) for thread: 1
-----------------
[0] (1 processes)
-----------------
main() at ?:?
  solver() at solver.f90:31
    solver_q_down() at solver_q_down.f90:52
      iter() at iter.f90:56
        mcalc() at mcalc.f90:38
          pmpi_allreduce__() at ?:?
            PMPI_Allreduce() at ?:?
              ompi_coll_tuned_allreduce_intra_dec_fixed() at ?:?
                ompi_coll_tuned_allreduce_intra_ring_segmented() at ?:?
                  ompi_coll_tuned_sendrecv_actual() at ?:?
                    ompi_request_default_wait_all() at ?:?
                      opal_progress() at ?:?
Stack trace(s) for thread: 2
-----------------
[0] (1 processes)
-----------------
start_thread() at ?:?
  btl_openib_async_thread() at ?:?
    poll() at ?:?
Stack trace(s) for thread: 3
-----------------
[0] (1 processes)
-----------------
start_thread() at ?:?
  service_thread_start() at ?:?
    select() at ?:?