Ahhh, that's the piece I was missing. I've been trying to debug everything I could think of related to 'btl', and was completely unaware that 'mtl' was also a transport.
If I run a job using --mca mtl ^psm, it does indeed run properly across all of my nodes. (Whether or not that's the 'right' thing to do is yet to be determined.)
Thanks for your help!
From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Dave Love
Sent: Tuesday, October 15, 2013 10:16 AM
To: Open MPI Users
Subject: Re: [OMPI users] Need help running jobs across different IB vendors
"Kevin M. Hildebrand" <kevin_at_[hidden]> writes:
> Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some
> with Mellanox cards and some with Qlogic cards.
Maybe you shouldn't... (I'm blessed in one cluster with three somewhat
incompatible types of QLogic card and a set of Mellanox ones, but
they're in separate islands, apart from the two different SDR ones.)
> I'm getting errors indicating "At least one pair of MPI processes are unable to reach each other for MPI communications". As far as I can tell all of the nodes are properly configured and able to reach each other, via IP and non-IP connections.
> I've also discovered that even if I turn off the IB transport via "--mca btl tcp,self" I'm still getting the same issue.
> The test works fine if I run it confined to hosts with identical IB cards.
> I'd appreciate some assistance in figuring out what I'm doing wrong.
I assume the QLogic cards are using PSM. You'd need to force them to
use openib with something like --mca mtl ^psm and make sure they have
the ipathverbs library available. You probably won't like the resulting
performance -- users here noticed when one set fell back to openib from
users mailing list