Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Linuxes shipping libibverbs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-22 15:35:03


Dirk / Debian guys --

When you install binary OMPI (which pulls in libibverbs and all the
rest), do you set the OpenFabrics kernel drivers to start upon boot?
Or does the user have to do that manually?

I ask because of the check Pasha proposes: if the user has started the
OpenFabrics kernel drivers, it's ok for OMPI to print warning messages
(this is better than the current: if libibverbs exists, it's ok for
OMPI to print warning messages).

On May 22, 2008, at 3:25 PM, Pavel Shamis (Pasha) wrote:

>
>>
>>> 1. Driver doesn't support the HCA - If I remember correct , RH40
>>> by default doesn't support ConnectX hca . The device_list will be
>>> empty. It is very exotic case.
>>> 2. Driver version doesn't correspond with fw version
>>> 3. FW was broken
>>> 4. Driver was broken and failed to start - it is not very exotic
>>> case too. Some times user make some modification - upgrade/install/
>>> etc.. and it brakes driver.
>>>
>>>> In such cases, the ibv_devinfo(1) and ibv_devices(1) commands
>>>> would show the same error.
>>> Yep these utilities will show the same error.
>>>
>>> Cases 1-2-3 we may cover pretty simple. OPENIB driver creates "/
>>> dev/infiniband" during his startup. So if /dev/infiniband exists
>>> and _get_device_list() is empty we may print warning.
>>
>> Ok, that seems reasonable.
>>
>>> I don't know how we can cover case 4 :-(
>>
>> If the user makes modifications to the driver and breaks it, I
>> don't think we can be held responsible for that -- prudence
>> declares that you should verify that your [self-modified] driver is
>> not broken first before blaming Open MPI. I'm not that concerned
>> about #4; most of my customers do not modify the drivers.
> Agree about #4.
>
> The check for /dev/infiniband should be simple and I think we can
> add it to 1.3 .
>
> Pasha.

-- 
Jeff Squyres
Cisco Systems