Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OPENIB unknown transport errors
From: Tim Miller (btamiller_at_[hidden])
Date: 2014-06-12 17:53:47


Aha ... looking at "ibv_devinfo -v" got me my first concrete hint of what's
going on. On a node that's working fine (w2), under port 1 there is a line:

LinkLayer: InfiniBand

On a node that is having trouble (w3), that line is not present. The
question is why this inconsistency occurs.

I don't seem to have ofed_info installed on my system -- not sure what
magical package Red Hat decided to hide that in. The InfiniBand stack I am
running is stock with our version of Scientific Linux (6.2). I am beginning
to wonder if this isn't some bug with the Red Hat/SL-provided InfiniBand
stack. I'll do some more poking, but at least now I've got something
semi-solid to poke at. Thanks for all of your help; I've attached the
results of "ibv_devinfo -v" for both systems, so if you see anything else
that jumps at you, please let me know.

Tim

On Sat, Jun 7, 2014 at 2:21 AM, Mike Dubman <miked_at_[hidden]>
wrote:

> could you please attach output of "ibv_devinfo -v" and "ofed_info -s"
> Thx
>
>
> On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller <btamiller_at_[hidden]> wrote:
>
>> Hi Josh,
>>
>> I asked one of our more advanced users to add the "-mca btl_openib_if_include
>> mlx4_0:1" argument to his job script. Unfortunately, the same error
>> occurred as before.
>>
>> We'll keep digging on our end; if you have any other suggestions, please
>> let us know.
>>
>> Tim
>>
>>
>> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamiller_at_[hidden]> wrote:
>>
>>> Hi Josh,
>>>
>>> Thanks for attempting to sort this out. In answer to your questions:
>>>
>>> 1. Node allocation is done by TORQUE, however we don't use the TM API to
>>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
>>> mpirun uses the ssh launcher to actually communicate and launch the
>>> processes on remote nodes.
>>> 2. We have only one port per HCA (the HCA silicon is integrated with the
>>> motherboard on most of our nodes, including all that have this issue). They
>>> are all configured to use InfiniBand (no IPoIB or other protocols).
>>> 3. No, we don't explicitly ask for a device port pair. We will try your
>>> suggestion and report back.
>>>
>>> Thanks again!
>>>
>>> Tim
>>>
>>>
>>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>> wrote:
>>>
>>>> Strange indeed. This info (remote adapter info) is passed around in the
>>>> modex and the struct is locally populated during add procs.
>>>>
>>>> 1. How do you launch jobs? Mpirun, srun, or something else?
>>>> 2. How many active ports do you have on each HCA? Are they all
>>>> configured to use IB?
>>>> 3. Do you explicitly ask for a device:port pair with the "if include"
>>>> mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
>>>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
>>>> IB.)
>>>>
>>>> Josh
>>>>
>>>>
>>>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamiller_at_[hidden]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to revive this thread, since I am still periodically getting
>>>>> errors of this type. I have built 1.8.1 with --enable-debug and run with
>>>>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
>>>>> additional information that I can find useful. I've gone ahead and attached
>>>>> a dump of the output under 1.8.1. The key lines are:
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>> same Infiniband network.
>>>>> Such mixed network trasport configuration is not supported by Open MPI.
>>>>>
>>>>> Local host: w1
>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>
>>>>> Remote host: w16
>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>>
>>>>> -------------------------------------------------------------------------
>>>>>
>>>>> Note that the vendor and part IDs are the same. If I immediately run
>>>>> on the same two nodes using MVAPICH2, everything is fine.
>>>>>
>>>>> I'm really very befuddled by this. OpenMPI sees that the two cards are
>>>>> the same and made by the same vendor, yet it thinks the transport types are
>>>>> different (and one is unknown). I'm hoping someone with some experience
>>>>> with how the OpenIB BTL works can shed some light on this problem...
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1?
>>>>>> I'm wondering if this is an issue with the OOB. If you have a debug build,
>>>>>> you can run -mca btl_openib_verbose 10
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, Tim
>>>>>>>
>>>>>>> Run "ibstat" on each host:
>>>>>>>
>>>>>>> 1. Make sure the adapters are alive and active.
>>>>>>>
>>>>>>> 2. Look at the Link Layer settings for host w34. Does it match host
>>>>>>> w4's?
>>>>>>>
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamiller_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand
>>>>>>>> adapters, and periodically our jobs abort at start-up with the following
>>>>>>>> error:
>>>>>>>>
>>>>>>>> ===
>>>>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>>>>> same Infiniband network.
>>>>>>>> Such mixed network trasport configuration is not supported by Open
>>>>>>>> MPI.
>>>>>>>>
>>>>>>>> Local host: w4
>>>>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>>>>
>>>>>>>> Remote host: w34
>>>>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>>>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>>>>> ===
>>>>>>>>
>>>>>>>> I've done a bit of googling and not found very much. We do not see
>>>>>>>> this issue when we run with MVAPICH2 on the same sets of nodes.
>>>>>>>>
>>>>>>>> Any advice or thoughts would be very welcome, as I am stumped by
>>>>>>>> what causes this. The nodes are all running Scientific Linux 6 with
>>>>>>>> Mellanox drivers installed via the SL-provided RPMs.
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>