Hi,

 

It looks like the rmda OSC component does not progress passive RMA operations at the target during calls to MPI_WIN_(UN)LOCK. As a sample case take a master-worker program where each worker writes to an entry in an array exposed in the master’s window:

 

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 

If (rank == 0)

{

   // Master code

   MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &array);

   MPI_Win_create(array, size * sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);

   do

   {

      MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);

      nonzeros = count non-zero elements of array

      MPI_Win_unlock(0, win);

   } while(nonzeros < size-1);

   MPI_Win_free(&win);

   MPI_Free_mem(array);

}

else

{

   // Worker code

   int one = 1;

   MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

   // Postpone the RMA with a rank-specific time

   sleep(rank);

   MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);

   MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);

   MPI_Win_unlock(0, win);

   MPI_Win_free(&win);

}

 

Attached is a complete sample program. The program hangs when run with the default MCA settings:

 

$ mpirun -n 3 ./rma.x

[1379003818.571960] 0 workers checked in

[1379003819.571317] Worker 1 acquired lock

[1379003819.571374] Worker 1 unlocking the window

[1379003820.571342] Worker 2 acquired lock

[1379003820.571384] Worker 2 unlocking the window

<hangs>

On the other hand, it works as expected if pt2pt is forced:

 

$ mpirun --mca osc pt2pt -n 3 ./rma.x | sort

[1379003926.000442] 0 workers checked in

[1379003926.998981] Worker 1 acquired lock

[1379003926.999027] Worker 1 unlocking the window

[1379003926.999076] Worker 1 synched

[1379003926.999078] 1 workers checked in

[1379003927.998917] Worker 2 acquired lock

[1379003927.998940] Worker 2 unlocking the window

[1379003927.998962] Worker 2 synched

[1379003927.998964] 2 workers checked in

[1379003927.998973] All workers checked in

[1379003927.998996] Worker 1 done

[1379003927.998996] Worker 2 done

[1379003927.999099] Master finished

 

All processes are started on the same host. Open MPI is 1.6.4 without progression thread. The output from ompi_info is attached. The same behaviour (hang with rdma, success with pt2pt) is observed when the tcp BTL is used and when all processes run on separate cluster nodes and talk via the openib BTL.

 

Is this a bug in the rdma OSC component or does the sample program violate the MPI correctness requirements for RMA operations?

 

Kind regards,

Hristo

 

--

Hristo Iliev, PhD – High Performance Computing Team

RWTH Aachen University, Center for Computing and Communication

Rechen- und Kommunikationszentrum der RWTH Aachen

Seffenter Weg 23, D 52074 Aachen (Germany)