Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Linuxes shipping libibverbs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-21 12:03:12


On May 21, 2008, at 11:14 AM, Brian W. Barrett wrote:

> I think having a parameter to turn off the warning is a great idea.
> So
> great in fact, that it already exists in the trunk and v1.2 :)!
> Setting
> the default value for the btl_base_warn_component_unused flag from 0
> to 1
> will have the desired effect.

Ah, ok. I either didn't know about this flag or forgot about it. :-)

I just tested this myself and see that there are actually *two* error
messages (on a machine where I installed libibverbs, but with no
OpenFabrics hardware, with OMPI 1.2.6):

% mpirun -np 1 hello
libibverbs: Fatal: couldn't read uverbs ABI version.
--------------------------------------------------------------------------
[0,1,0]: OpenIB on host eddie.osl.iu.edu was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

So the MCA param takes care of the OMPI message; I'll contact the
libibverbs authors about their message.

> I'm not sure I agree with setting the default to 0, however. The
> warning
> has proven extremely useful for diagnosing that IB (or less often GM
> or
> MX) isn't properly configured on a compute node due to some random
> error.
> It's trivially easy for any packaging group to have the line
>
> btl_base_warn_component_unused = 0
>
> added to $prefix/etc/openmpi-mca-params.conf during the install
> phase of
> the package build (indeed, our simple build scripts at LANL used to do
> this on a regular bases due to our need to tweek the OOB to keep IPoIB
> happier at scale).
>
> I think keeping the Debian guys happy is a good thing. Giving them an
> easy way to turn off silly warnings is a good thing. Removing a known
> useful warning to help them doesn't seem like a good thing.

I guess that this is what I am torn about. Yes, it's a useful message
-- in some cases. But now that libibverbs is shipping in Debain and
other Linuxes, the number of machines out there with verbs-capable
hardware is far, far smaller than the number of machines without verbs-
capable hardware. Specifically:

1. The number of cases where seeing the message by default is *not*
useful is now potentially [much] larger than the number of cases where
the default message is useful.

2. An out-of-the-box "mpirun a.out" will print warning messages in
perfectly valid/good configurations (no verbs-capable hardware, but
just happen to have libibverbs installed). This is a Big Deal.

3. Problems with HCA hardware and/or verbs stack are uncommon
(nowadays). I'd be ok asking someone to enable a debug flag to get
more information on configuration problems or hardware faults.

Shouldn't we be optimizing for the common case?

In short: I think it's no longer safe to assume that machines with
libibverbs installed must also have verbs-capable hardware.

-- 
Jeff Squyres
Cisco Systems