Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Multi-Rail and Open IB BTL
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-12 06:03:07


On Nov 9, 2007, at 1:24 PM, Don Kerr wrote:

> both, I was thinking of listing what I think are multi-rail
> requirements
> but wanted to understand what the current state of things are

I believe the OF portion of the FAQ describes what we do in the v1.2
series (right Gleb?); I honestly don't remember what we do today on
the trunk (I'm pretty sure that Gleb has tweaked it recently).

As for what we *should* do, it's a very complicated question. :-\

This is where all these discussions regarding affinity, NUMA, and NUNA
(non uniform network architecture) come into play. A "very simple"
scenario may be something like this:

- host A is UMA (perhaps even a uniprocessor) with 2 ports that are
equidistant from the 1 MPI process on that host
- host B is the same, except it only has 1 active port on the same IB
subnet as host A's 2 ports
- the ports on both hosts are all the same speed (e.g., DDR)
- the ports all share a single, common, non-blocking switch

But even with this "simple" case, the answer as to what you should do
is still unclear. If host A is able to drive both of its DDR links at
full speed, you're could cause congestion at the link to host B if the
MPI process on host A opens two connections. But if host A is only
able to drive the same effective bandwidth out of its two ports as it
is through a single port, then the end effect is probably fairly
negligible -- it might not make much of a difference at all as to
whether the MPI process A opens 1 or 2 connections to host B.

But then throw in other effects that I mentioned above (NUMA, NUNA,
etc.), and the equation becomes much more complex. In some cases, it
may be good to open 1 connection (e.g., bandwidth load balancing); in
other cases it may be good to open 2 (e.g., congestion avoidance /
spreading traffic around the network, particularly in the presence of
other MPI jobs on the network). :-\

Such NUNA architectures may sound unusual to some, but both IBM and HP
sell [many] blade-based HPC solutions with NUNA internal IB networks.
Specifically: this is a fairly common scenario.

So this is a difficult question without a great answer. The hope is
that the new carto framework that Sharon sent requirements around for
will be able to at least make topology information available from both
the host and the network so that BTLs can possibly make some
intelligent decisions about what to do in these kinds of scenarios.

-- 
Jeff Squyres
Cisco Systems