Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI piggyback mechanism
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2008-02-05 09:22:51


Oleg,

Is there an implementation in Open MPI of your techniques ? Can we put
our greedy nasty pawns on it ?

Thanks for the link, Josh.

Aurelien

Le 5 févr. 08 à 08:39, Josh Hursey a écrit :

> Oleg,
>
> Interesting work. You mentioned late in your email that you believe
> that adding support for piggybacking to the MPI standard would be the
> best solution. As you may know, the MPI Forum has reconvened and there
> is a working group for Fault Tolerance. This working group is
> discussing a piggybacking interface proposal for the standard, amongst
> other things. If you are interested in contributing to this
> conversation you can find the mailing list here:
> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
>
> Best,
> Josh
>
> On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote:
>
>> Hi,
>>
>> I've been working on MPI piggyback technique as a part of my PhD
>> work.
>>
>> Although MPI does not provide a native support, there are several
>> different
>> solutions to transmit piggyback data over every MPI communication.
>> You may
>> find a brief overview in papers [1, 2]. This includes copying the
>> original
>> message and the extra data to a bigger buffer, sending additional
>> message or
>> changing the sendtype to a dynamically created wrapper datatype that
>> contains a pointer to the original data and the piggyback data. I
>> have tried
>> all mechanisms and they work, but considering the overhead, there is
>> no "the
>> best" technique that outperforms the others in all scenarios. Jeff
>> Squyres
>> had interesting comments on this subject before (in this mailing
>> list).
>>
>> Finally after some benchmarking, I have implemented *a *hybrid
>> technique
>> that combines existing mechanisms. For small, point-to-point messages
>> datatype wrapping seems to be the less intrusive, at least
>> considering
>> OpenMPI implementation of derived datatypes. For large, point-to-
>> point
>> messages, experiments confirmed that sending an additional message
>> is much
>> cheaper than wrapping (and besides the intrusion is small as we are
>> already
>> sending a large message). Moreover, the implementation may
>> interleave the
>> original send with an asynchronous send of piggyback data. This
>> optimization
>> partially hides the latency of additional send and lowers overall
>> intrusion.
>> The same criteria can be applied for collective operations, except
>> barrier
>> and reduce operations. As the former does not transmit any data and
>> the
>> latter transforms the data, the only solution is to send additional
>> messages.
>>
>> There is a penalty of course. Especially for collective operations
>> with very
>> small messages the intrusion may reach 15% and that's a lot. It than
>> decreases down to 0.1% for bigger messages, but anyway it's still
>> there. I
>> don't know what are your requirements/expectations for that issue.
>> The only
>> work that reported lower overheads is [3] but they added native
>> piggyback
>> support by changing underlying MPI implementation.
>>
>> I think the best possible option is to add piggyback support for MPI
>> as a
>> part of the standard. A growing number of runtime tools use this
>> functionality for multiple reasons and certainly PMPI itself is not
>> enough.
>> References of interest:
>>
>> -
>>
>> [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance
>> Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI
>> Conference, LNCS, vol. 3666, pp. 359-367, 2005. They review various
>> techniques and come up with datatype wrapping.
>>
>> -
>>
>> [2] Schulz, M., "Extracting Critical Path Graphs from MPI
>> Applications". Cluster Computing 2005, IEEE International, pp. 1-10,
>> September 2005. They use datatype wrapping.
>> - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of
>> Communication
>> Activity in Distributed Applications". They add support for
>> piggyback at MPI
>> implementation level and report very low overheads (no surprise).
>>
>> Regards,
>> Oleg Morajko
>>
>>
>> On Feb 1, 2008 5:08 PM, Aurélien Bouteiller <bouteill_at_[hidden]>
>> wrote:
>>
>>> I don't know of any work in that direction for now. Indeed, we plan
>>> to
>>> eventually integrate at least causal message logging in the pml-v,
>>> which also includes piggybacking. Therefore we are open for
>>> collaboration with you on this matter. Please let us know :)
>>>
>>> Aurelien
>>>
>>>
>>>
>>> Le 1 févr. 08 à 09:51, Thomas Ropars a écrit :
>>>
>>>> Hi,
>>>>
>>>> I'm currently working on optimistic message logging and I would
>>>> like
>>>> to
>>>> implement an optimistic message logging protocol in OpenMPI.
>>>> Optimistic
>>>> message logging protocols piggyback information about dependencies
>>>> between processes on the application messages to be able to find a
>>>> consistent global state after a failure. That's why I'm interested
>>>> in
>>>> the problem of piggybacking information on MPI messages.
>>>>
>>>> Is there some works on this problem at the moment ?
>>>> Has anyone already implemented some mechanisms in OpenMPI to
>>>> piggyback
>>>> data on MPI messages?
>>>>
>>>> Regards,
>>>>
>>>> Thomas
>>>>
>>>> Oleg Morajko wrote:
>>>>> Hi,
>>>>>
>>>>> I'm developing a causality chain tracking library and need a
>>>>> mechanism
>>>>> to attach an extra data to every MPI message, so called piggyback
>>>>> mechanism.
>>>>>
>>>>> As far as I know there are a few solutions to this problem from
>>>>> which
>>>>> the two fundamental ones are the following:
>>>>>
>>>>> * Dynamic datatype wrapping - if a user MPI_Send, let's say 1024
>>>>> doubles, the wrapped send call implementation dynamically
>>>>> creates a derived datatype that is a structure composed of a
>>>>> pointer to 1024 doubles and extra fields to be piggybacked. The
>>>>> datatype is constructed with absolute addresses to avoid
>>>>> copying
>>>>> the original buffer. The receivers side creates the equivalent
>>>>> datatype to receive the original data and extra data. The
>>>>> performance of this solution depends on the how good is derived
>>>>> data type handling, but seems to be lightweight.
>>>>>
>>>>> * Sending extra data in a separate message -- seems this can have
>>>>> much more significant overhead
>>>>>
>>>>> Do you know any other portable solution?
>>>>>
>>>>> I have implemented the first solution for P2P operations and it
>>>>> works
>>>>> pretty well. However there are problems with collective
>>>>> operations.
>>>>> There are 2 classes of collective calls that are problematic:
>>>>>
>>>>> 1. Single receiver calls, like MPI_Gather. The sender tasks in
>>>>> gather can be handled in the same way as a normal send, a data
>>>>> item is wrapped and extra data is piggybacked with the message.
>>>>> The problem is at the receiver side when a root gathers N data
>>>>> items that must be received in an array big enough to receive
>>>>> all items strided by datatype extent.
>>>>>
>>>>> In particular, it seems impossible to construct a datatype that
>>>>> contains data item and extra data (i.e. structure type with
>>>>> absolute addresses) AND make an array of these datatypes
>>>>> separated by a fixed extent. For example: data item to receive
>>>>> from every process is a vector of 1024 doubles. Extra data is a
>>>>> single integer. User provides a receive buffer with place for N
>>>>> * 1024 * double. The library allocates an array of N integers
>>>>> to
>>>>> receive piggybacked data. How to construct a datatype that can
>>>>> be used to receive data in MPI_Gather?
>>>>>
>>>>> 2. MPI_Reduce calls. There is no problem with datatypes as the
>>>>> receiver gets the single data item and not an array as in
>>>>> previous case. The problem is the reduction operator itself
>>>>> (MPI_Op) because these operators do not work with wrapped data
>>>>> types. So I can create a new operator to recognize the wrapped
>>>>> data type that extracts the original data (skipping extra data)
>>>>> and performs the original reduction. The point is how to invoke
>>>>> the original reduction on an existing datatype. I have found
>>>>> that Open MPI calls internally ompi_op_reduce(op, inbuf, rbuf,
>>>>> count, dtype) this solves a problem. However this makes the
>>>>> code
>>>>> MPI-implementation dependent. Any idea on more portable
>>>>> options?
>>>>>
>>>>>
>>>>> Thank you in advance for any comment.
>>>>>
>>>>> --Oleg
>>>>>
>>>>>
>>>>>
>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Dr. Aurélien Bouteiller
>>> Sr. Research Associate - Innovative Computing Laboratory
>>> Suite 350, 1122 Volunteer Boulevard
>>> Knoxville, TN 37996
>>> 865 974 6321
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users