On Wed, Apr 13, 2011 at 2:49 PM, Barrett, Brian W <bwbarre@sandia.gov> wrote:
This is mostly an issue of how MPICH2 and Open MPI implement lock/unlock.
Some might call what I'm about to describe erroneous.  I wrote the
one-sided code in Open MPI and may be among those people.

In both implementations, one-sided communication is not necessarily truly
asynchronous.  That is, the target of an operation may have to enter the
MPI library (MPI_Wtime does not count as entering the library in this
case) to progress Lock/Unlock calls.  So rank 2 calls lock (which is a
no-op in both implementations), calls put, calls unlock, and waits for a
response.  Ranks 0 and 1 wait for a second and enter lock, get, and
unlock.  At this point, data actually starts to move.  Chances are, rank 0
is going to process it's request first, hence the get from rank 0
returning 0.  Then rank 0 will perhaps process some other requests before
it leaves unlock (perhaps not), and enter barrier.  At this point, it will
progress everything until the other ranks enter barrier, meaning rank 2's
put and rank 2 and 3s get will finally be processed.


Brian,

Ok, that helps explain what's going on.

I understand the difficulty in implementing truly asynchronous RMA especially
when the remote process does not yield to the progress engine periodically.
Although the standard is lacking and ambiguous on the details (ordering of RMA calls,
behavior of Lock/Unlock) of passive synchronization, it gives a sense that only the
origin process is explicitly involved in the transfer and passive target communication
can further be used to emulate a shared memory model via MPI calls.

But given the existing behavior, all bets are off and it renders passive synchronization
(MPI_Win_unlock) mostly similar to active synchronization (MPI_Win_fence).
In trying to emulate a distributed shared memory model, I was hoping to do things
like:

MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
MPI_Get(&out, 1, MPI_INT, 0, 0, 1, MPI_INT, win);
if (out % 2 == 0)
     out++;
MPI_Accumulate(&out, 1, MPI_INT, 0, 0, 1, MPI_INT, MPI_REPLACE, win);
MPI_Win_unlock(0, win);

but it is impossible to implement such atomic sections given no semantic guarantees
on ordering of the RMA calls.

Thanks,
Abhishek
 
In case you're wondering, the specification wasn't disobeyed in the
communication order; the lock description is very loose and is relative to
other MPI events.  So if you put the barrier before the lock/get/unlock,
you'd get the answer you wanted because rank 2's lock would have to occur
before rank 0's.  With no other MPI synchronization, there's no
requirement that be true, and the locking order could be 0, 1, 2, 2 if it
really wanted to be (ie, it would be perfectly legal for rank 1 to also
return 0).

This is obviously not ideal, and one of the areas of focus for the MPI-3
standardization effort.  In Open MPI, adding true asynchronous behavior is
difficult.  The original design assumed that the lowest communication
layers would be able to provide asynchronous completion events to progress
the one-sided implementation.  Thus far, only the authors of the TCP stack
have provided such behavior and it's not as well tested as other modes of
operation.

Brian

On 4/13/11 12:31 PM, "Abhishek Kulkarni" <abbyzcool@gmail.com> wrote:

>Hello,
>
>I am trying to better understand the semantics of passive synchronization
>in one-sided communication calls. Doesn't MPI_Win_unlock()
>block to ensure that all the preceeding RMA calls in that epoch have been
>synced?
>
>In that case, why is an undefined value returned when trying to load from
>a local window? (see below)
>
>    MPI_Alloc_mem(128, MPI_INFO_NULL, &ptr);
>    MPI_Win_create(ptr, 128, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
>
>    // write to the target window of the head node
>    if (rank == (size - 1)) {
>        MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
>        in = 10;
>        MPI_Put(&in, 1, MPI_INT, 0, 0, 1, MPI_INT, win);
>
>        MPI_Win_unlock(0, win);
>    } else {
>        // busy wait
>        start = MPI_Wtime();
>        end = MPI_Wtime();
>        while ((end - start) < 1)
>            end = MPI_Wtime();
>    }
>
>    // read from the head node's window
>    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
>    MPI_Get(&out, 1, MPI_INT, 0, 0, 1, MPI_INT, win);
>    MPI_Win_unlock(0, win);
>
>    MPI_Barrier(MPI_COMM_WORLD);
>
>    printf("R%d: %d\n", rank, out);
>
>The output of the above program with 1.5.3rc1 (and also with MPICH2
>1.4rc2) is:
>R2: 10
>R1: 10
>R0: 0
>
>whereas I expect to see:
>R2: 10
>R1: 10
>R0: 10
>
>Thanks,
>Abhishek
>
>_______________________________________________
>users mailing list
>users@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users


--
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories






_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users