Subject: Re: [OMPI users] Heterogeneous OpenFabrics hardware
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-27 11:01:14

On Jan 27, 2009, at 10:19 AM, Peter Kjellstrom wrote:

>> It is worth clarifying a point in this discussion that I neglected to
>> mention in my initial post: although Open MPI may not work *by
>> default* with heterogeneous HCAs/RNICs, it is quite possible/likely
>> that if you manually configure Open MPI to use the same verbs/
>> hardware
>> settings across all your HCAs/RNICs (assuming that you use a set of
>> values that is compatible with all your hardware) that MPI jobs
>> spanning multiple different kinds of HCAs or RNICs will work fine.
>> See this post on the devel list for a few more details:
> So is it correct that each rank will check its HCA-model and then
> pick up
> suitable settings for that HCA?

Correct. We have an INI-style file that is installed in $pkgdir/mca-
btl-openib-device-params.ini (typically expands to $prefix/share/
openmpi/mca-btl-openib-device-params.ini). This file contains a bunch
of device-specific parameters, but it also has a "general" section
that can be applied to any device if no specific match is found.

> If so maybe OpenMPI could fall back to a very conservative settings
> if more
> than one HCA model was detected among the ranks. Or would this require
> communication in a stage where that would be complicated and/or ugly?

Today we don't do this kind of check; we just assume that every other
MPI process is using the same hardware and/or the settings pulled from
the INI file will be compatible. AFAIK, most (all?) other MPI's do
the same thing.

We *could* do that kind of check:

a) there hasn't been enough customer demand for it / no one has
submitted a patch to do so
b) it might be a bit complicated because the startup sequence in the
openib BTL is a little complex
c) we are definitely moving to a scenario (at scale) where there is
little/no communication at startup about coordinating information from
all of the MPI peer processes; this strategy might be problematic in
those scenarios (i.e., the coordination / determination of
"conservative" settings would have to be done by a human and likely
pre-posted to a file on each node -- still hand-waving a bit because
that design isn't finalized/implemented yet)
d) programatically finding what "conservative" settings are workable
across a wide variety of devices may be problematic because individual
device capabilities can vary wildly (does it have SRQ? can it support
more than one BSRQ? what's a good MTU? ...?)

I think d) is a big sticking point; we *could* make extremely
conservative settings that should probably work everywhere. I can see
at least one potential problematic scenario:

- cluster has N nodes
- a year later, an HCA in 1 node dies
- get a new HCA, perhaps even from a different vendor
- capabilities of the new HCA and old HCAs are different
- so OMPI falls back to "extreme conservative" settings
- jobs that run on that one node suffer in performance
- jobs that do not run on that node see "normal" performance
- users are confused

I suppose that we could print a Big Hairy Warning(tm) if we fall back
to extreme conservative settings, but it still seems to create the
potential to violate the Law of Least Astonishment.

Jeff Squyres
Cisco Systems