Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Progress of the asynchronous messages
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-11-06 14:00:07

George is right -- you *can* do this, but it is *not advised* (you'll
likely run out of memory or other resources pretty quickly -- if you
can run at all!). :-)

Try mpi_leave_pinned, and check out those FAQ sections that I sent,
particularly the OpenFabrics section, for how to specifically tune
various behaviors of the openib BTL.

On Nov 6, 2008, at 1:52 PM, George Bosilca wrote:

> In order to get good performance out of your test application, the
> whole message has to be send in just one fragment. The reason is
> that as long as there is no progress thread for the MPI library
> (internal to the library), there is no way to make progress.
> Now, I can explain how to do this, but trust me this is an ugly
> hack, that make your application MPI implementation specific, i.e.
> not portable in terms of performance. But, I guess this decision is
> up to you. The really bad thing that might happens, is that in the
> case the receiver is slower that the sender, you will buffer all
> this eager message or messages in the receiver memory (what a
> waste), you will use a lot more memory copies and you give up the
> possibility to use the RMA features available on your network. So
> yes, your specific code will maybe/eventually runs faster, but the
> price to pay is way to expensive [from my perspective].
> Here is how you can do this: Based on the network you use (open ib
> in this case), the parameter selecting the first fragment size is
> called *_eager_limit. Do a "ompi_info --param btl openib", grep for
> eager_limit to figure out the name of the argument, and set it using
> "--mca <name> value" to the value that you want. As an example, I
> think this will work for openib: "--mca btl_openib_eager_limit
> 8388648" (8388608 + 40 for internal headers).
> george.
> On Nov 6, 2008, at 12:52 PM, Eugene Loh wrote:
>> vladimir marjanovic wrote:
>>> In order to overlap communication and computation I don't want to
>>> use MPI_Wait.
>> Right. One thing to keep in mind is that there are two ways of
>> overlapping communication and computation. One is you start a send
>> (MPI_Isend), you do a bunch of computation while the message is
>> being sent, and then after the message has been sent you call
>> MPI_Wait just to clean up. This assumes that the MPI
>> implementation can send a message while control of the program has
>> been returned to you. The experts can give you the fine print, but
>> my simple assertion is, "This doesn't usually happen."
>> Rather, the MPI implementation typically will send data only when
>> your code is in some MPI call. That's why you have to call
>> MPI_Test periodically... or some other MPI function.
>>> For sure the message is being decomposed into chucks and the size
>>> of chuck is probably defined by environment variable.
>>> Maybe do you know how can I control size of chuck?
>> I don't. Try running "ompi_info -a" and looking through the
>> parameters. For the shared-memory BTL, it's
>> mca_btl_sm_max_frag_size. I also see something like
>> coll_sm_fragment_size. Maybe look at the parameters that have
>> "btl_openib_max" in their names.
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Cisco Systems