Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OPENIB unknown transport errors
From: Mike Dubman (miked_at_[hidden])
Date: 2014-06-07 02:21:10


could you please attach output of "ibv_devinfo -v" and "ofed_info -s"
Thx

On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller <btamiller_at_[hidden]> wrote:

> Hi Josh,
>
> I asked one of our more advanced users to add the "-mca btl_openib_if_include
> mlx4_0:1" argument to his job script. Unfortunately, the same error
> occurred as before.
>
> We'll keep digging on our end; if you have any other suggestions, please
> let us know.
>
> Tim
>
>
> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamiller_at_[hidden]> wrote:
>
>> Hi Josh,
>>
>> Thanks for attempting to sort this out. In answer to your questions:
>>
>> 1. Node allocation is done by TORQUE, however we don't use the TM API to
>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
>> mpirun uses the ssh launcher to actually communicate and launch the
>> processes on remote nodes.
>> 2. We have only one port per HCA (the HCA silicon is integrated with the
>> motherboard on most of our nodes, including all that have this issue). They
>> are all configured to use InfiniBand (no IPoIB or other protocols).
>> 3. No, we don't explicitly ask for a device port pair. We will try your
>> suggestion and report back.
>>
>> Thanks again!
>>
>> Tim
>>
>>
>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>
>>> Strange indeed. This info (remote adapter info) is passed around in the
>>> modex and the struct is locally populated during add procs.
>>>
>>> 1. How do you launch jobs? Mpirun, srun, or something else?
>>> 2. How many active ports do you have on each HCA? Are they all
>>> configured to use IB?
>>> 3. Do you explicitly ask for a device:port pair with the "if include"
>>> mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
>>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
>>> IB.)
>>>
>>> Josh
>>>
>>>
>>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamiller_at_[hidden]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to revive this thread, since I am still periodically getting
>>>> errors of this type. I have built 1.8.1 with --enable-debug and run with
>>>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
>>>> additional information that I can find useful. I've gone ahead and attached
>>>> a dump of the output under 1.8.1. The key lines are:
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> Open MPI detected two different OpenFabrics transport types in the same
>>>> Infiniband network.
>>>> Such mixed network trasport configuration is not supported by Open MPI.
>>>>
>>>> Local host: w1
>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>>
>>>> Remote host: w16
>>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>
>>>> -------------------------------------------------------------------------
>>>>
>>>> Note that the vendor and part IDs are the same. If I immediately run on
>>>> the same two nodes using MVAPICH2, everything is fine.
>>>>
>>>> I'm really very befuddled by this. OpenMPI sees that the two cards are
>>>> the same and made by the same vendor, yet it thinks the transport types are
>>>> different (and one is unknown). I'm hoping someone with some experience
>>>> with how the OpenIB BTL works can shed some light on this problem...
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm
>>>>> wondering if this is an issue with the OOB. If you have a debug build, you
>>>>> can run -mca btl_openib_verbose 10
>>>>>
>>>>> Josh
>>>>>
>>>>>
>>>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Hi, Tim
>>>>>>
>>>>>> Run "ibstat" on each host:
>>>>>>
>>>>>> 1. Make sure the adapters are alive and active.
>>>>>>
>>>>>> 2. Look at the Link Layer settings for host w34. Does it match host
>>>>>> w4's?
>>>>>>
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamiller_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand
>>>>>>> adapters, and periodically our jobs abort at start-up with the following
>>>>>>> error:
>>>>>>>
>>>>>>> ===
>>>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>>>> same Infiniband network.
>>>>>>> Such mixed network trasport configuration is not supported by Open
>>>>>>> MPI.
>>>>>>>
>>>>>>> Local host: w4
>>>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>>>
>>>>>>> Remote host: w34
>>>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>>>> ===
>>>>>>>
>>>>>>> I've done a bit of googling and not found very much. We do not see
>>>>>>> this issue when we run with MVAPICH2 on the same sets of nodes.
>>>>>>>
>>>>>>> Any advice or thoughts would be very welcome, as I am stumped by
>>>>>>> what causes this. The nodes are all running Scientific Linux 6 with
>>>>>>> Mellanox drivers installed via the SL-provided RPMs.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>