Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] calling sendi earlier in the PML
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-03-04 17:29:01

On Mar 4, 2009, at 14:44 , Eugene Loh wrote:

> Let me try another thought here. Why do we have BTL sendi functions
> at all? I'll make an assertion and would appreciate feedback: a
> BTL sendi function contributes nothing to optimizing send latency.
> To optimize send latency in the "immediate" case, we need *ONLY* PML
> work.

Because otherwise you will have to make 2 BTL calls instead of one
plus one extra memcpy (or not depending on your network). First you
will have to call btl_alloc to get back a descriptor with some BTL
memory attached to it. The you will put your data (including the
header) in this memory and once ready call btl_send. With sendi there
is only one call from the PML into the BTL, but this time it is the
BTL responsibility to prepare the data that will be sent.

> I'm churning a lot and not making much progress, but I'll try
> chewing on that idea (unless someone points out it's utterly
> ridiculous). I'll look into having PML ignore sendi functions
> altogether and just make the "send-immediate" path work fast with
> normal send functions. If that works, then we can get rid of sendi
> functions and hopefully have a solution that makes sense for everyone.

This is utterly ridiculous (I hope you really expect someone to say
it). As I said before, SM is only one of the networks supported by
Open MPI. Independent on how much I would like to have better shared
memory performance, I will not agree with any PML modifications that
are SM oriented. We did that in the past with other BTLs and it turned
out to be a bad idea, so I'm clearly not in favor of doing the same
mistake twice.

Regarding the sendi there are at least 3 networks that can take
advantage of it: Portals, MX and Sicortex. Some of them do this right
now, some others in the near future. Moreover, for these particular
networks there is no way to avoid extra overhead without this feature
(for very obscure reasons such as non contiguous pieces of memory only
known by the BTL that can decrease the number of network operations).