On May 21, 2008, at 4:29 PM, Brian W. Barrett wrote:
> Previously, there has not been such a distinction, so I really have no
> idea which caused the openib BTL throw its error (and never really
> as it was always somebody else's problem at that point).
In the scenarios that I'm talking about, ibv_devinfo(1) and
ibv_devices(1) commands should return that there are no devices (you
have OFED or equivalent installed but have no verbs-capable hardware):
[15:21] queeg:~/mpi % ibv_devinfo
No IB devices found
[16:41] queeg:~/mpi % ibv_devices
device node GUID
[16:41] queeg:~/mpi %
Since there's no need for an immediate change to the code base --
perhaps you could watch over the next few weeks and when you see
problems of the kind that you're worried about, run ibv_devices and
ibv_devinfo. If you see OMPI-reported openfabrics problems with no
warnings from libibverbs itself (like I mentioned in my first mail)
and ibv_dev* are reporting no devices, then we need to worry about
cases where the verbs stack itself doesn't even see the devices (which
is a Really Big Error; the OS/driver stack doesn't even see the device).
If ibv_dev* reports that there *are* devices when you see the errors
that you're worried about, then OMPI would have gotten past this first
case and reported something a bit more specific. And therefore is a
different warning than the one I'm proposing to remove [by default].
> I'm only concerned about the case where there's an IB card, the user
> expects the IB card to be used, and the IB card isn't used.
Can you put in a site wide
btl = ^tcp
to avoid the problem? If the IB card fails, then you'll get
unreachable MPI errors.
> If the
> changes don't silence a warning in that situation, I'm fine with
> you do. But does ibv_get_device_list return an HCA when the port is
> (because the SM failed and the machine rebooted since that time)?
> If not,
> we still ahve a (fairly common, unfortunately) error case that we
> need to
> report (in my opinion).
Agreed. This scenario is already covered by the checking that the
openib BTL performs, and I agree that we should not remove this warning.
That being said, note that the current error-checking code in the
openib BTL only reports if *no* active ports are found on the host.
If there are multiple ports in a host where some are active and some
are [erroneously] not active, OMPI does not report this (because some
real-world users have dual-port HCAs but are only using 1 port).
Two options jump to mind:
1. Add yet another MCA param to say "all my ports should be active;
warn/error if you find any non-active ports."
2. Add yet another MCA param where ports that *should* be active are
itemized. If OMPI finds that any of them are not active, warn/error.
#1 could really be a special case of #2 (e.g., a keyword "all"). Both
of these options wouldn't be too difficult to do, but we technically
are feature frozen...