On 08-Jul-11 7:59 PM, Bill Johnstone wrote:
> Hello, and thanks for the reply.
> ----- Original Message -----
>> From: Jeff Squyres<jsquyres_at_[hidden]>
>> Sent: Thursday, July 7, 2011 5:14 PM
>> Subject: Re: [OMPI users] InfiniBand, different OpenFabrics transport types
>> On Jun 28, 2011, at 1:46 PM, Bill Johnstone wrote:
>>> I have a heterogeneous network of InfiniBand-equipped hosts which are all
>> connected to the same backbone switch, an older SDR 10 Gb/s unit.
>>> One set of nodes uses the Mellanox "ib_mthca" driver, while the
>> other uses the "mlx4" driver.
>>> This is on Linux 2.6.32, with Open MPI 1.5.3 .
>>> When I run Open MPI across these node types, I get an error message of the
>>> Open MPI detected two different OpenFabrics transport types in the same
>> Infiniband network.
>>> Such mixed network trasport configuration is not supported by Open MPI.
>>> Local host: compute-chassis-1-node-01
>>> Local adapter: mthca0 (vendor 0x5ad, part ID 25208)
>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>> Wow, that's cool ("UNKNOWN"). Are you using an old version of
>> OFED or something?
> No, clean local build of OFED 1.5.3 packages, but I don't have the full huge complement of OFED packages installed, since our setup is not using IPoIB, SDP, etc.
> ibdiagnet, and all the usual suspects work as expected, and I'm able to do large scale Open MPI runs just fine, so long as I don't cross Mellanox HCA types.
>> Mellanox -- how can this happen?
>>> Remote host: compute-chassis-3-node-01
>>> Remote Adapter: (vendor 0x2c9, part ID 26428)
>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>>> Two questions:
>>> 1. Why is this occurring if both adapters have all the OpenIB software set
>> up? Is it because Open MPI is trying to use functionality such as ConnectX with
>> the newer hardware, which is incompatible with older hardware, or is it
>> something more mundane?
>> It's basically a mismatch of IB capabilities -- Open MPI is trying to use
>> more advanced features in some nodes and not in others.
> I also tried looking in the adapter-specific settings in the .ini file under /etc, but the only difference I found was in MTU, and I think that's configured on the switch.
>>> 2. How can I use IB amongst these heterogeneous nodes?
>> Mellanox will need to answer this question... It might be able to be done, but
>> I don't know how offhand. The first issue is to figure out why you're
>> getting TRANSPORT_UNKNOWN on the one node.
> OK, please let me know what other things to try or what other info I can provide.
I'll check the MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thing and get back to you.
One question though, just to make sure we're on the same page: so the jobs do run OK on
the older HCAs, as long as they run *only* on the older HCAs, right?
Please make sure that the jobs are using only IB with "--mca btl openib,self" parameters.
> users mailing list