Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI piggyback mechanism
From: Oleg Morajko (olegmorajko_at_[hidden])
Date: 2008-02-05 11:03:40


Thank you Josh, that's interesting. I'll have a look.
--Oleg

On Feb 5, 2008 2:39 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:

> Oleg,
>
> Interesting work. You mentioned late in your email that you believe
> that adding support for piggybacking to the MPI standard would be the
> best solution. As you may know, the MPI Forum has reconvened and there
> is a working group for Fault Tolerance. This working group is
> discussing a piggybacking interface proposal for the standard, amongst
> other things. If you are interested in contributing to this
> conversation you can find the mailing list here:
> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
>
> Best,
> Josh
>
> On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote:
>
> > Hi,
> >
> > I've been working on MPI piggyback technique as a part of my PhD work.
> >
> > Although MPI does not provide a native support, there are several
> > different
> > solutions to transmit piggyback data over every MPI communication.
> > You may
> > find a brief overview in papers [1, 2]. This includes copying the
> > original
> > message and the extra data to a bigger buffer, sending additional
> > message or
> > changing the sendtype to a dynamically created wrapper datatype that
> > contains a pointer to the original data and the piggyback data. I
> > have tried
> > all mechanisms and they work, but considering the overhead, there is
> > no "the
> > best" technique that outperforms the others in all scenarios. Jeff
> > Squyres
> > had interesting comments on this subject before (in this mailing
> > list).
> >
> > Finally after some benchmarking, I have implemented *a *hybrid
> > technique
> > that combines existing mechanisms. For small, point-to-point messages
> > datatype wrapping seems to be the less intrusive, at least considering
> > OpenMPI implementation of derived datatypes. For large, point-to-point
> > messages, experiments confirmed that sending an additional message
> > is much
> > cheaper than wrapping (and besides the intrusion is small as we are
> > already
> > sending a large message). Moreover, the implementation may
> > interleave the
> > original send with an asynchronous send of piggyback data. This
> > optimization
> > partially hides the latency of additional send and lowers overall
> > intrusion.
> > The same criteria can be applied for collective operations, except
> > barrier
> > and reduce operations. As the former does not transmit any data and
> > the
> > latter transforms the data, the only solution is to send additional
> > messages.
> >
> > There is a penalty of course. Especially for collective operations
> > with very
> > small messages the intrusion may reach 15% and that's a lot. It than
> > decreases down to 0.1% for bigger messages, but anyway it's still
> > there. I
> > don't know what are your requirements/expectations for that issue.
> > The only
> > work that reported lower overheads is [3] but they added native
> > piggyback
> > support by changing underlying MPI implementation.
> >
> > I think the best possible option is to add piggyback support for MPI
> > as a
> > part of the standard. A growing number of runtime tools use this
> > functionality for multiple reasons and certainly PMPI itself is not
> > enough.
> > References of interest:
> >
> > -
> >
> > [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance
> > Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI
> > Conference, LNCS, vol. 3666, pp. 359-367, 2005. They review various
> > techniques and come up with datatype wrapping.
> >
> > -
> >
> > [2] Schulz, M., "Extracting Critical Path Graphs from MPI
> > Applications". Cluster Computing 2005, IEEE International, pp. 1-10,
> > September 2005. They use datatype wrapping.
> > - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of
> > Communication
> > Activity in Distributed Applications". They add support for
> > piggyback at MPI
> > implementation level and report very low overheads (no surprise).
> >
> > Regards,
> > Oleg Morajko
> >
> >
> > On Feb 1, 2008 5:08 PM, Aurélien Bouteiller <bouteill_at_[hidden]>
> > wrote:
> >
> >> I don't know of any work in that direction for now. Indeed, we plan
> >> to
> >> eventually integrate at least causal message logging in the pml-v,
> >> which also includes piggybacking. Therefore we are open for
> >> collaboration with you on this matter. Please let us know :)
> >>
> >> Aurelien
> >>
> >>
> >>
> >> Le 1 févr. 08 à 09:51, Thomas Ropars a écrit :
> >>
> >>> Hi,
> >>>
> >>> I'm currently working on optimistic message logging and I would like
> >>> to
> >>> implement an optimistic message logging protocol in OpenMPI.
> >>> Optimistic
> >>> message logging protocols piggyback information about dependencies
> >>> between processes on the application messages to be able to find a
> >>> consistent global state after a failure. That's why I'm interested
> >>> in
> >>> the problem of piggybacking information on MPI messages.
> >>>
> >>> Is there some works on this problem at the moment ?
> >>> Has anyone already implemented some mechanisms in OpenMPI to
> >>> piggyback
> >>> data on MPI messages?
> >>>
> >>> Regards,
> >>>
> >>> Thomas
> >>>
> >>> Oleg Morajko wrote:
> >>>> Hi,
> >>>>
> >>>> I'm developing a causality chain tracking library and need a
> >>>> mechanism
> >>>> to attach an extra data to every MPI message, so called piggyback
> >>>> mechanism.
> >>>>
> >>>> As far as I know there are a few solutions to this problem from
> >>>> which
> >>>> the two fundamental ones are the following:
> >>>>
> >>>> * Dynamic datatype wrapping - if a user MPI_Send, let's say 1024
> >>>> doubles, the wrapped send call implementation dynamically
> >>>> creates a derived datatype that is a structure composed of a
> >>>> pointer to 1024 doubles and extra fields to be piggybacked. The
> >>>> datatype is constructed with absolute addresses to avoid
> >>>> copying
> >>>> the original buffer. The receivers side creates the equivalent
> >>>> datatype to receive the original data and extra data. The
> >>>> performance of this solution depends on the how good is derived
> >>>> data type handling, but seems to be lightweight.
> >>>>
> >>>> * Sending extra data in a separate message -- seems this can have
> >>>> much more significant overhead
> >>>>
> >>>> Do you know any other portable solution?
> >>>>
> >>>> I have implemented the first solution for P2P operations and it
> >>>> works
> >>>> pretty well. However there are problems with collective operations.
> >>>> There are 2 classes of collective calls that are problematic:
> >>>>
> >>>> 1. Single receiver calls, like MPI_Gather. The sender tasks in
> >>>> gather can be handled in the same way as a normal send, a data
> >>>> item is wrapped and extra data is piggybacked with the message.
> >>>> The problem is at the receiver side when a root gathers N data
> >>>> items that must be received in an array big enough to receive
> >>>> all items strided by datatype extent.
> >>>>
> >>>> In particular, it seems impossible to construct a datatype that
> >>>> contains data item and extra data (i.e. structure type with
> >>>> absolute addresses) AND make an array of these datatypes
> >>>> separated by a fixed extent. For example: data item to receive
> >>>> from every process is a vector of 1024 doubles. Extra data is a
> >>>> single integer. User provides a receive buffer with place for N
> >>>> * 1024 * double. The library allocates an array of N integers
> >>>> to
> >>>> receive piggybacked data. How to construct a datatype that can
> >>>> be used to receive data in MPI_Gather?
> >>>>
> >>>> 2. MPI_Reduce calls. There is no problem with datatypes as the
> >>>> receiver gets the single data item and not an array as in
> >>>> previous case. The problem is the reduction operator itself
> >>>> (MPI_Op) because these operators do not work with wrapped data
> >>>> types. So I can create a new operator to recognize the wrapped
> >>>> data type that extracts the original data (skipping extra data)
> >>>> and performs the original reduction. The point is how to invoke
> >>>> the original reduction on an existing datatype. I have found
> >>>> that Open MPI calls internally ompi_op_reduce(op, inbuf, rbuf,
> >>>> count, dtype) this solves a problem. However this makes the
> >>>> code
> >>>> MPI-implementation dependent. Any idea on more portable
> >>>> options?
> >>>>
> >>>>
> >>>> Thank you in advance for any comment.
> >>>>
> >>>> --Oleg
> >>>>
> >>>>
> >>>>
> >>
> ------------------------------------------------------------------------
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> --
> >> Dr. Aurélien Bouteiller
> >> Sr. Research Associate - Innovative Computing Laboratory
> >> Suite 350, 1122 Volunteer Boulevard
> >> Knoxville, TN 37996
> >> 865 974 6321
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>