Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OPENIB unknown transport errors
From: Tim Miller (btamiller_at_[hidden])
Date: 2014-06-05 19:32:16


Hi Josh,

Thanks for attempting to sort this out. In answer to your questions:

1. Node allocation is done by TORQUE, however we don't use the TM API to
launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
mpirun uses the ssh launcher to actually communicate and launch the
processes on remote nodes.
2. We have only one port per HCA (the HCA silicon is integrated with the
motherboard on most of our nodes, including all that have this issue). They
are all configured to use InfiniBand (no IPoIB or other protocols).
3. No, we don't explicitly ask for a device port pair. We will try your
suggestion and report back.

Thanks again!

Tim

On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:

> Strange indeed. This info (remote adapter info) is passed around in the
> modex and the struct is locally populated during add procs.
>
> 1. How do you launch jobs? Mpirun, srun, or something else?
> 2. How many active ports do you have on each HCA? Are they all configured
> to use IB?
> 3. Do you explicitly ask for a device:port pair with the "if include" mca
> param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
> IB.)
>
> Josh
>
>
> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamiller_at_[hidden]> wrote:
>
>> Hi,
>>
>> I'd like to revive this thread, since I am still periodically getting
>> errors of this type. I have built 1.8.1 with --enable-debug and run with
>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
>> additional information that I can find useful. I've gone ahead and attached
>> a dump of the output under 1.8.1. The key lines are:
>>
>> --------------------------------------------------------------------------
>> Open MPI detected two different OpenFabrics transport types in the same
>> Infiniband network.
>> Such mixed network trasport configuration is not supported by Open MPI.
>>
>> Local host: w1
>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>
>> Remote host: w16
>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>> -------------------------------------------------------------------------
>>
>> Note that the vendor and part IDs are the same. If I immediately run on
>> the same two nodes using MVAPICH2, everything is fine.
>>
>> I'm really very befuddled by this. OpenMPI sees that the two cards are
>> the same and made by the same vendor, yet it thinks the transport types are
>> different (and one is unknown). I'm hoping someone with some experience
>> with how the OpenIB BTL works can shed some light on this problem...
>>
>> Tim
>>
>>
>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>
>>>
>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm
>>> wondering if this is an issue with the OOB. If you have a debug build, you
>>> can run -mca btl_openib_verbose 10
>>>
>>> Josh
>>>
>>>
>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>> wrote:
>>>
>>>> Hi, Tim
>>>>
>>>> Run "ibstat" on each host:
>>>>
>>>> 1. Make sure the adapters are alive and active.
>>>>
>>>> 2. Look at the Link Layer settings for host w34. Does it match host
>>>> w4's?
>>>>
>>>>
>>>> Josh
>>>>
>>>>
>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamiller_at_[hidden]> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adapters,
>>>>> and periodically our jobs abort at start-up with the following error:
>>>>>
>>>>> ===
>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>> same Infiniband network.
>>>>> Such mixed network trasport configuration is not supported by Open MPI.
>>>>>
>>>>> Local host: w4
>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>
>>>>> Remote host: w34
>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>> ===
>>>>>
>>>>> I've done a bit of googling and not found very much. We do not see
>>>>> this issue when we run with MVAPICH2 on the same sets of nodes.
>>>>>
>>>>> Any advice or thoughts would be very welcome, as I am stumped by what
>>>>> causes this. The nodes are all running Scientific Linux 6 with Mellanox
>>>>> drivers installed via the SL-provided RPMs.
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>