Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Progress of the asynchronous messages
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-11-06 13:52:12


In order to get good performance out of your test application, the
whole message has to be send in just one fragment. The reason is that
as long as there is no progress thread for the MPI library (internal
to the library), there is no way to make progress.

Now, I can explain how to do this, but trust me this is an ugly hack,
that make your application MPI implementation specific, i.e. not
portable in terms of performance. But, I guess this decision is up to
you. The really bad thing that might happens, is that in the case the
receiver is slower that the sender, you will buffer all this eager
message or messages in the receiver memory (what a waste), you will
use a lot more memory copies and you give up the possibility to use
the RMA features available on your network. So yes, your specific code
will maybe/eventually runs faster, but the price to pay is way to
expensive [from my perspective].

Here is how you can do this: Based on the network you use (open ib in
this case), the parameter selecting the first fragment size is called
*_eager_limit. Do a "ompi_info --param btl openib", grep for
eager_limit to figure out the name of the argument, and set it using
"--mca <name> value" to the value that you want. As an example, I
think this will work for openib: "--mca btl_openib_eager_limit
8388648" (8388608 + 40 for internal headers).

   george.

On Nov 6, 2008, at 12:52 PM, Eugene Loh wrote:

> vladimir marjanovic wrote:
>>
>> In order to overlap communication and computation I don't want to
>> use MPI_Wait.
> Right. One thing to keep in mind is that there are two ways of
> overlapping communication and computation. One is you start a send
> (MPI_Isend), you do a bunch of computation while the message is
> being sent, and then after the message has been sent you call
> MPI_Wait just to clean up. This assumes that the MPI implementation
> can send a message while control of the program has been returned to
> you. The experts can give you the fine print, but my simple
> assertion is, "This doesn't usually happen."
>
> Rather, the MPI implementation typically will send data only when
> your code is in some MPI call. That's why you have to call MPI_Test
> periodically... or some other MPI function.
>> For sure the message is being decomposed into chucks and the size
>> of chuck is probably defined by environment variable.
>> Maybe do you know how can I control size of chuck?
> I don't. Try running "ompi_info -a" and looking through the
> parameters. For the shared-memory BTL, it's
> mca_btl_sm_max_frag_size. I also see something like
> coll_sm_fragment_size. Maybe look at the parameters that have
> "btl_openib_max" in their names.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pkcs7-signature attachment: smime.p7s