Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to reduce Isend & Irecv bandwidth?
From: Thomas Watson (exascale.system_at_[hidden])
Date: 2013-05-01 17:14:06

Hi Gus,

Thanks for your suggestion!

The problem of this two-phased data exchange is as follows. Each rank can
have data blocks that will be exchanged to potentially all other ranks. So
if a rank needs to tell all the other ranks about which blocks to receive,
it would require an all-to-all collective communication during phase one
(e.g., MPI_Gatherallv). Because such collective communication is blocking
in current stable OpenMPI (MPI-2), it would have a negative impact on
scalability of the application, especially when we have a large number of
MPI ranks. This negative impact would not be compensated by the bandwidth
saved :-)

What I really need is something like this: Isend sets count to 0 if a block
is not dirty. On the receiving side, MPI_Waitall deallocates the
corresponding Irecv request immediately and sets the Irecv request handle
to MPI_REQUEST_NULL as if it were a normal Irecv. I am wondering if someone
could confirm this behavior with me? I could do an experiment on this too...



On Wed, May 1, 2013 at 3:46 PM, Gus Correa <gus_at_[hidden]> wrote:

> Maybe start the data exchange by sending a (presumably short)
> list/array/index-function of the dirty/not-dirty blocks status
> (say, 0=not-dirty,1=dirty),
> then putting if conditionals before the Isend/Irecv so that only
> dirty blocks are exchanged?
> I hope this helps,
> Gus Correa
> On 05/01/2013 01:28 PM, Thomas Watson wrote:
>> Hi,
>> I have a program where each MPI rank hosts a set of data blocks. After
>> doing computation over *some of* its local data blocks, each MPI rank
>> needs to exchange data with other ranks. Note that the computation may
>> involve only a subset of the data blocks on a MPI rank. The data
>> exchange is achieved at each MPI rank through Isend and Irecv and then
>> Waitall to complete the requests. Each pair of Isend and Irecv exchanges
>> a corresponding pair of data blocks at different ranks. Right now, we do
>> Isend/Irecv for EVERY block!
>> The idea is that because the computation at a rank may only involves a
>> subset of blocks, we could mark those blocks as dirty during the
>> computation. And to reduce data exchange bandwidth, we could only
>> exchanges those *dirty* pairs across ranks.
>> The problem is: if a rank does not compute on a block 'm', and if it
>> does not call Isend for 'm', then the receiving rank must somehow know
>> this and either a) does not call Irecv for 'm' as well, or b) let Irecv
>> for 'm' fail gracefully.
>> My questions are:
>> 1. how Irecv will behave (actually how MPI_Waitall will behave) if the
>> corresponding Isend is missing?
>> 2. If we still post Isend for 'm', but because we really do not need to
>> send any data for 'm', can I just set a "flag" in Isend so that
>> MPI_Waitall on the receiving side will "cancel" the corresponding Irecv
>> immediately? For example, I can set the count in Isend to 0, and on the
>> receiving side, when MPI_Waitall see a message with empty payload, it
>> reclaims the corresponding Irecv? In my code, the correspondence between
>> a pair of Isend and Irecv is established by a matching TAG.
>> Thanks!
>> Jacky
>> ______________________________**_________________
>> users mailing list
>> users_at_[hidden]
> ______________________________**_________________
> users mailing list
> users_at_[hidden]