Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Heterogeneous OpenFabrics hardware
From: Nifty Tom Mitchell (niftyompi_at_[hidden])
Date: 2009-01-26 16:33:29

On Mon, Jan 26, 2009 at 11:31:43AM -0800, Paul H. Hargrove wrote:
> Jeff Squyres wrote:
>> The Interop Working Group (IWG) of the OpenFabrics Alliance asked me
>> to bring a question to the Open MPI user and developer communities: is
>> anyone interested in having a single MPI job span HCAs or RNICs from
>> multiple vendors? (pardon the cross-posting, but I did want to ask
>> each group separately -- because the answers may be different)
>> The interop testing lab at the University of New Hampshire
>> ( discovered that most
>> (all?) MPI implementations fail when having a single MPI job span HCAs
>> from multiple vendors and/or span RNICs from multiple vendors. I
>> don't remember the exact details (and they may not be public, anyway),
>> but I'm pretty sure that OMPI failed when used with QLogic and
>> Mellanox HCAs in a single MPI job. This is fairly unsurprising, given
>> how we tune Open MPI's use of OpenFabrics-capable hardware based on an
>> internal Open MPI .ini file.
>> So my question is: does anyone want/need to support jobs that span
>> HCAs from multiple vendors and/or RNICs from multiple vendors?
> I think the interop problems could easily arise with multiple models of
> HCA from the same vendor due to the OMPI practice Jeff mentions of
> tuning the HCAs differently according to model.
> There are institutions who built clusters with many node configurations
> over time, the Sandia CPlant comes to mind as perhaps the best known
> example, but I have seen others.
> In such a system, new machines are added periodically and old ones
> removed. However, at any given time a mixture of ages and
> configurations will exist. In such a system there is NOT always the
> expectation/requirement that an MPI job run across all (or even
> multiple) node types, but such a capability is valuable if it can be
> maintained. So, while I personally don't have the want/need to support
> a single MPI job spanning multiple HCA models/vendors, I can make a case
> for the theoretical value.

I suspect that mixing hardware to any serious degree will slow MPI to the point that
there is little or no value in doing this. Since purchasing decisions can
tip on as little as a 5% advantage I cannot see vendors loving this except
as a marketing check off item.

I suspect the most common transport would be TCP/IP and that would introduce
gateway and routing issues between quick fabrics and other quick fabrics
that would be intolerable for most HPC applications (but not all).

It may be that IPoIB would be a sufficient communication layer for Infiniband
fabrics but would not address Myrinet or GigE+ links. Gateways and
bridges would have to come to the party.

My memory of Infiniband interoperability results is that all IB vendors
can inter-operate correctly with IPoIB. Perhaps this is sufficient but
OpenFabrics is not limited to Infiniband. i.e from the OFA web site
"The mission of the OpenFabrics Alliance (OFA) is to develop, distribute
and promote a unified, transport-independent, open-source software stack
for RDMA-capable fabrics and networks, including InfiniBand and Ethernet."

If open-mpi took the position that TCP/IP was the foundation for MPI
interoperability it might be possible (baseline). This alone might be sufficient
for some site expansion plans.

Picking a transport from the list of OFA choices as a common ground does not seem trivial to me....

On this point...
>> but I'm pretty sure that OMPI failed when used with QLogic and
>> Mellanox HCAs in a single MPI job. This is fairly unsurprising, given
if OMPI was compiled to use the QLogic PSM layer then it would
interoperate best with PSM capable hardware. Since QLogic sells multiple
HCAs including Mellanox design HCAs it is incorrect to make a blanket
statement that QLogic HCAs do not inter-operate with Mellanox. QLogic PSM
capable hardware will inter-operate with non PSM hardware with most if
not all of the OFA protocols. However my observation based assertion
that 5% is a decision making advantage makes the PSM library the default.
There are some other issues with the latest Mellanox products...

	T o m   M i t c h e l l