Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Linuxes shipping libibverbs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-21 12:03:12

On May 21, 2008, at 11:14 AM, Brian W. Barrett wrote:

> I think having a parameter to turn off the warning is a great idea.
> So
> great in fact, that it already exists in the trunk and v1.2 :)!
> Setting
> the default value for the btl_base_warn_component_unused flag from 0
> to 1
> will have the desired effect.

Ah, ok. I either didn't know about this flag or forgot about it. :-)

I just tested this myself and see that there are actually *two* error
messages (on a machine where I installed libibverbs, but with no
OpenFabrics hardware, with OMPI 1.2.6):

% mpirun -np 1 hello
libibverbs: Fatal: couldn't read uverbs ABI version.
[0,1,0]: OpenIB on host was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.

So the MCA param takes care of the OMPI message; I'll contact the
libibverbs authors about their message.

> I'm not sure I agree with setting the default to 0, however. The
> warning
> has proven extremely useful for diagnosing that IB (or less often GM
> or
> MX) isn't properly configured on a compute node due to some random
> error.
> It's trivially easy for any packaging group to have the line
> btl_base_warn_component_unused = 0
> added to $prefix/etc/openmpi-mca-params.conf during the install
> phase of
> the package build (indeed, our simple build scripts at LANL used to do
> this on a regular bases due to our need to tweek the OOB to keep IPoIB
> happier at scale).
> I think keeping the Debian guys happy is a good thing. Giving them an
> easy way to turn off silly warnings is a good thing. Removing a known
> useful warning to help them doesn't seem like a good thing.

I guess that this is what I am torn about. Yes, it's a useful message
-- in some cases. But now that libibverbs is shipping in Debain and
other Linuxes, the number of machines out there with verbs-capable
hardware is far, far smaller than the number of machines without verbs-
capable hardware. Specifically:

1. The number of cases where seeing the message by default is *not*
useful is now potentially [much] larger than the number of cases where
the default message is useful.

2. An out-of-the-box "mpirun a.out" will print warning messages in
perfectly valid/good configurations (no verbs-capable hardware, but
just happen to have libibverbs installed). This is a Big Deal.

3. Problems with HCA hardware and/or verbs stack are uncommon
(nowadays). I'd be ok asking someone to enable a debug flag to get
more information on configuration problems or hardware faults.

Shouldn't we be optimizing for the common case?

In short: I think it's no longer safe to assume that machines with
libibverbs installed must also have verbs-capable hardware.

Jeff Squyres
Cisco Systems