Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Linuxes shipping libibverbs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-21 16:56:49


On May 21, 2008, at 4:29 PM, Brian W. Barrett wrote:

> Previously, there has not been such a distinction, so I really have no
> idea which caused the openib BTL throw its error (and never really
> cared,
> as it was always somebody else's problem at that point).

In the scenarios that I'm talking about, ibv_devinfo(1) and
ibv_devices(1) commands should return that there are no devices (you
have OFED or equivalent installed but have no verbs-capable hardware):

-----
[15:21] queeg:~/mpi % ibv_devinfo
No IB devices found
[16:41] queeg:~/mpi % ibv_devices
     device node GUID
     ------ ----------------
[16:41] queeg:~/mpi %
-----

Since there's no need for an immediate change to the code base --
perhaps you could watch over the next few weeks and when you see
problems of the kind that you're worried about, run ibv_devices and
ibv_devinfo. If you see OMPI-reported openfabrics problems with no
warnings from libibverbs itself (like I mentioned in my first mail)
and ibv_dev* are reporting no devices, then we need to worry about
cases where the verbs stack itself doesn't even see the devices (which
is a Really Big Error; the OS/driver stack doesn't even see the device).

If ibv_dev* reports that there *are* devices when you see the errors
that you're worried about, then OMPI would have gotten past this first
case and reported something a bit more specific. And therefore is a
different warning than the one I'm proposing to remove [by default].

> I'm only concerned about the case where there's an IB card, the user
> expects the IB card to be used, and the IB card isn't used.

Can you put in a site wide

btl = ^tcp

to avoid the problem? If the IB card fails, then you'll get
unreachable MPI errors.

> If the
> changes don't silence a warning in that situation, I'm fine with
> whatever
> you do. But does ibv_get_device_list return an HCA when the port is
> down
> (because the SM failed and the machine rebooted since that time)?

Yes.

> If not,
> we still ahve a (fairly common, unfortunately) error case that we
> need to
> report (in my opinion).

Agreed. This scenario is already covered by the checking that the
openib BTL performs, and I agree that we should not remove this warning.

That being said, note that the current error-checking code in the
openib BTL only reports if *no* active ports are found on the host.
If there are multiple ports in a host where some are active and some
are [erroneously] not active, OMPI does not report this (because some
real-world users have dual-port HCAs but are only using 1 port).

Two options jump to mind:

1. Add yet another MCA param to say "all my ports should be active;
warn/error if you find any non-active ports."
2. Add yet another MCA param where ports that *should* be active are
itemized. If OMPI finds that any of them are not active, warn/error.

#1 could really be a special case of #2 (e.g., a keyword "all"). Both
of these options wouldn't be too difficult to do, but we technically
are feature frozen...

-- 
Jeff Squyres
Cisco Systems