Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OPENIB unknown transport errors
From: Tim Miller (btamiller_at_[hidden])
Date: 2014-06-06 17:53:22


Hi Josh,

I asked one of our more advanced users to add the "-mca btl_openib_if_include
mlx4_0:1" argument to his job script. Unfortunately, the same error
occurred as before.

We'll keep digging on our end; if you have any other suggestions, please
let us know.

Tim

On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamiller_at_[hidden]> wrote:

> Hi Josh,
>
> Thanks for attempting to sort this out. In answer to your questions:
>
> 1. Node allocation is done by TORQUE, however we don't use the TM API to
> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
> mpirun uses the ssh launcher to actually communicate and launch the
> processes on remote nodes.
> 2. We have only one port per HCA (the HCA silicon is integrated with the
> motherboard on most of our nodes, including all that have this issue). They
> are all configured to use InfiniBand (no IPoIB or other protocols).
> 3. No, we don't explicitly ask for a device port pair. We will try your
> suggestion and report back.
>
> Thanks again!
>
> Tim
>
>
> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>
>> Strange indeed. This info (remote adapter info) is passed around in the
>> modex and the struct is locally populated during add procs.
>>
>> 1. How do you launch jobs? Mpirun, srun, or something else?
>> 2. How many active ports do you have on each HCA? Are they all configured
>> to use IB?
>> 3. Do you explicitly ask for a device:port pair with the "if include" mca
>> param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
>> IB.)
>>
>> Josh
>>
>>
>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamiller_at_[hidden]> wrote:
>>
>>> Hi,
>>>
>>> I'd like to revive this thread, since I am still periodically getting
>>> errors of this type. I have built 1.8.1 with --enable-debug and run with
>>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
>>> additional information that I can find useful. I've gone ahead and attached
>>> a dump of the output under 1.8.1. The key lines are:
>>>
>>>
>>> --------------------------------------------------------------------------
>>> Open MPI detected two different OpenFabrics transport types in the same
>>> Infiniband network.
>>> Such mixed network trasport configuration is not supported by Open MPI.
>>>
>>> Local host: w1
>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>
>>> Remote host: w16
>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>> -------------------------------------------------------------------------
>>>
>>> Note that the vendor and part IDs are the same. If I immediately run on
>>> the same two nodes using MVAPICH2, everything is fine.
>>>
>>> I'm really very befuddled by this. OpenMPI sees that the two cards are
>>> the same and made by the same vendor, yet it thinks the transport types are
>>> different (and one is unknown). I'm hoping someone with some experience
>>> with how the OpenIB BTL works can shed some light on this problem...
>>>
>>> Tim
>>>
>>>
>>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm
>>>> wondering if this is an issue with the OOB. If you have a debug build, you
>>>> can run -mca btl_openib_verbose 10
>>>>
>>>> Josh
>>>>
>>>>
>>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>>> wrote:
>>>>
>>>>> Hi, Tim
>>>>>
>>>>> Run "ibstat" on each host:
>>>>>
>>>>> 1. Make sure the adapters are alive and active.
>>>>>
>>>>> 2. Look at the Link Layer settings for host w34. Does it match host
>>>>> w4's?
>>>>>
>>>>>
>>>>> Josh
>>>>>
>>>>>
>>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamiller_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adapters,
>>>>>> and periodically our jobs abort at start-up with the following error:
>>>>>>
>>>>>> ===
>>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>>> same Infiniband network.
>>>>>> Such mixed network trasport configuration is not supported by Open
>>>>>> MPI.
>>>>>>
>>>>>> Local host: w4
>>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>>
>>>>>> Remote host: w34
>>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>>> ===
>>>>>>
>>>>>> I've done a bit of googling and not found very much. We do not see
>>>>>> this issue when we run with MVAPICH2 on the same sets of nodes.
>>>>>>
>>>>>> Any advice or thoughts would be very welcome, as I am stumped by what
>>>>>> causes this. The nodes are all running Scientific Linux 6 with Mellanox
>>>>>> drivers installed via the SL-provided RPMs.
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>