OK. I wanted to post my patch later this week, but you beat me to it, so
here it is attached. But my approach is completely different and may
coexist with yours.
On Tue, Jun 05, 2007 at 12:03:55PM -0400, George Bosilca wrote:
> The multi-NIC support was broken for a while. This patch correct it
It was always completely broken as far as I can tell.
> and take it back to the original performances (sum of bandwidths).
Do you have sum of bandwidths between TCP and IB without leave_pinned just
by using this patch. I doubt it. The problem with current code is that if
you have mix of two networks and one of them doesn't need memory registration
(like TCP) it hijacks all the traffic unless leave_pinned is in use. The reason is
that memory is always appears to be registered on TCP and OB1 never
tries to use something different for RDMA.
> The idea behind is to decide in the beginning how to split the
> message in fragments and their sizes and then only reschedule on the
> BTLs that complete a fragment. So Instead of using a round-robin over
> the BTL when we select a new BTL, we keep trace of the last BTL and
> schedule the new fragment over it.
Are you sure you attached correct patch? What the patch does doesn't
match your description. It schedules new rdma fragment upon completion
of the previous instead of blindly do round-robin and this is very good
idea, but unfortunately implementation breaks threaded support (and this
is not good as was decided today). Current assumption is that OB1
schedules one request only on one CPU at a time. When you call new
mca_pml_ob1_recv_request_schedule_btl_exclusive() function schedule loop
may run on another CPU.
> This way, we get good performance even when the relative difference
> between the characteristics of the BTLs are huge. This patch was on
> my modified versions for a while and it was used on one of our multi-
> NIC clusters by several users for few months.
I suppose all NICs are ethernet?
My approach is to pre calculate how much data should be send on each BTL
in advance according to relative weight before we start scheduling.
During schedule function there is no more calculation just chop data in
rdma_frag_length peaces and send it. The current code doesn't do balance
according to btl_weight at all if rdma_frag_length is much smaller
than message length (it is INT_MAX for TCP, so TCP is special in this regard
too). The reason is that each time schedule loop calculates how much data should
be send it calculates a fragment size according to btl_weight and then chops it
according to rdma_frag_length and lose any information it got from previous
calculation. Just look at the code and do a simulation. You don't see it
when all BTL have same bandwidth because no matter what relative
bandwidth BTLs have OB1 will always schedule more or less same number of
bytes on each one.