Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Linuxes shipping libibverbs
From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2008-05-21 16:29:59

On Wed, 21 May 2008, Jeff Squyres wrote:

> On May 21, 2008, at 3:38 PM, Jeff Squyres wrote:
>>> It would be great if libibverbs could return two different error
>>> messages
>>> - one for "there's no IB card in this machine" and one for "there's
>>> an IB
>>> card here, but we can't initialize it". I think that would make this
>>> argument go away. Open MPI could probably mimic that behavior by
>>> parsing
>>> the PCI tables, but that sounds ... painful.
> Thinking about this a bit more -- I think it depends on what kind of
> errors you are worried about seeing. IBV does separate the discovery
> of devices (ibv_get_device_list) from trying to open a device
> (ibv_open_device). So hypothetically, we *can* distinguish between
> these kinds of errors already.
> Do you see devices that are so broken that they don't show up in the
> list returned from ibv_get_device_list?
> FWIW: the *only* case I'm talking about changing the default for is
> when ibv_get_device_list returns an empty list (meaning that according
> to the verbs stack, there are no devices in the host). I think that
> we should *always* warn for any kinds of errors that occur after that
> (e.g., we find a device but can't open it, we find one or more devices
> but no active ports, etc.).

Previously, there has not been such a distinction, so I really have no
idea which caused the openib BTL throw its error (and never really cared,
as it was always somebody else's problem at that point).

I'm only concerned about the case where there's an IB card, the user
expects the IB card to be used, and the IB card isn't used. If the
changes don't silence a warning in that situation, I'm fine with whatever
you do. But does ibv_get_device_list return an HCA when the port is down
(because the SM failed and the machine rebooted since that time)? If not,
we still ahve a (fairly common, unfortunately) error case that we need to
report (in my opinion).