Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OPENIB unknown transport errors
From: Joshua Ladd (jladd.mlnx_at_[hidden])
Date: 2014-06-05 14:22:42


Strange indeed. This info (remote adapter info) is passed around in the
modex and the struct is locally populated during add procs.

1. How do you launch jobs? Mpirun, srun, or something else?
2. How many active ports do you have on each HCA? Are they all configured
to use IB?
3. Do you explicitly ask for a device:port pair with the "if include" mca
param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
(assuming you have a ConnectX-3 HCA and port 1 is configured to run over
IB.)

Josh

On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamiller_at_[hidden]> wrote:

> Hi,
>
> I'd like to revive this thread, since I am still periodically getting
> errors of this type. I have built 1.8.1 with --enable-debug and run with
> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
> additional information that I can find useful. I've gone ahead and attached
> a dump of the output under 1.8.1. The key lines are:
>
> --------------------------------------------------------------------------
> Open MPI detected two different OpenFabrics transport types in the same
> Infiniband network.
> Such mixed network trasport configuration is not supported by Open MPI.
>
> Local host: w1
> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>
> Remote host: w16
> Remote Adapter: (vendor 0x2c9, part ID 26428)
> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
> -------------------------------------------------------------------------
>
> Note that the vendor and part IDs are the same. If I immediately run on
> the same two nodes using MVAPICH2, everything is fine.
>
> I'm really very befuddled by this. OpenMPI sees that the two cards are the
> same and made by the same vendor, yet it thinks the transport types are
> different (and one is unknown). I'm hoping someone with some experience
> with how the OpenIB BTL works can shed some light on this problem...
>
> Tim
>
>
> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>
>>
>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm
>> wondering if this is an issue with the OOB. If you have a debug build, you
>> can run -mca btl_openib_verbose 10
>>
>> Josh
>>
>>
>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>
>>> Hi, Tim
>>>
>>> Run "ibstat" on each host:
>>>
>>> 1. Make sure the adapters are alive and active.
>>>
>>> 2. Look at the Link Layer settings for host w34. Does it match host
>>> w4's?
>>>
>>>
>>> Josh
>>>
>>>
>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamiller_at_[hidden]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adapters,
>>>> and periodically our jobs abort at start-up with the following error:
>>>>
>>>> ===
>>>> Open MPI detected two different OpenFabrics transport types in the same
>>>> Infiniband network.
>>>> Such mixed network trasport configuration is not supported by Open MPI.
>>>>
>>>> Local host: w4
>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>>
>>>> Remote host: w34
>>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>> ===
>>>>
>>>> I've done a bit of googling and not found very much. We do not see this
>>>> issue when we run with MVAPICH2 on the same sets of nodes.
>>>>
>>>> Any advice or thoughts would be very welcome, as I am stumped by what
>>>> causes this. The nodes are all running Scientific Linux 6 with Mellanox
>>>> drivers installed via the SL-provided RPMs.
>>>>
>>>> Tim
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>