Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-06-27 01:49:54

On Tue, Jun 26, 2007 at 05:42:05PM -0400, George Bosilca wrote:
> Gleb,
> Simplifying the code and getting better performance is always a good
> approach (at least from my perspective). However, your patch still
> dispatch the messages over the BTLs in a round robin fashion, which
> doesn't look to me as the best approach. How about merging your patch
> and mine ? We will get a better data distribution and a better
> scheduling (on-demand based on the network load).
Yes, my patch still does round robing. Incorporate your idea into OB1 is
on my todo list. We just need to honor OB1 multithreaded rules i.e if on
RDMA completion scheduling for the request is already running do nothing
otherwise restart scheduling from the BTL that received a completion.
The problem is that multiple completions may run in different threads
simultaneously, so we have to be careful and I don't want to introduce
new locks if possible.

> Btw, did you compare my patch with yours on your multi-NIC system ?
> With my patch on our system with 3 networks (2*1Gbs and one 100 Mbs)
> I'm close to 99% of the total bandwidth. I'll try to see what I get
> with yours.
I tested only with multiple HCAs not ethernet NICs. The TCP BTL is
special because its rdma_pipline_frag configured to be INT_MAX thus
there is no fairness issue in OB1 scheduling because request is send by
only looping once in recv_schedule_exclusive function. I think that if you'll
configure rdma_pipline_frag to be 128K your overall bandwidth will drop
to less then 50% (I don't have such setup so can't check) and that is the
problem I tried to address with the patch.

> Now that we're looking at improving the performances of the multi-BTL
> stuff I think I have another idea. How about merging the ack with the
> next pipeline fragment for RDMA (except for the last fragment) ?
Can you elaborate? If you are talking about ACK from receiver on match
then we already merge it with first PUT message if possible.

> Thanks,
> george.
> On Jun 25, 2007, at 8:28 AM, Gleb Natapov wrote:
> >Hello,
> >
> > Attached patch improves OB1 scheduling algorithm between multiple
> >links. Current algorithm perform very poorly if interconnects with
> >very
> >different bandwidth values are used. For big message sizes it always
> >divide traffic equally between all available interconnects. Attached
> >patch change this. It calculates for each message how much data
> >should be
> >send via each link according to relative weight of the link. This is
> >done for RDMAed part of the message as well as for the part that is
> >send
> >by send/recv in the case of pipeline protocol. As a side effect
> >send_schedule/recv_schedule functions are greatly simplified.
> >
> > Surprisingly (at least for me) this patch is also greatly improves
> >some
> >benchmarks results when multiple links with the same bandwidth are
> >in use.
> >Attached postscript shows some benchmark results with and without the
> >patch. I used two computers connected with 4 DDR HCAs for this
> >benchmark.
> >Each HCA is capable of ~1600MB on its own.
> >
> >--
> >
> >Gleb.<ob1_multi_nic_scheduling.diff><>____________
> >___________________________________
> >devel mailing list
> >devel_at_[hidden]
> >

> _______________________________________________
> devel mailing list
> devel_at_[hidden]