This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
Ahhh, that's the piece I was missing. I've been trying to debug everything I could think of related to 'btl', and was completely unaware that 'mtl' was also a transport.
If I run a job using --mca mtl ^psm, it does indeed run properly across all of my nodes. (Whether or not that's the 'right' thing to do is yet to be determined.)
Thanks for your help!
From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Dave Love
Sent: Tuesday, October 15, 2013 10:16 AM
To: Open MPI Users
Subject: Re: [OMPI users] Need help running jobs across different IB vendors
"Kevin M. Hildebrand" <kevin_at_[hidden]> writes:
> Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some
> with Mellanox cards and some with Qlogic cards.
Maybe you shouldn't... (I'm blessed in one cluster with three somewhat
incompatible types of QLogic card and a set of Mellanox ones, but
they're in separate islands, apart from the two different SDR ones.)
> I'm getting errors indicating "At least one pair of MPI processes are unable to reach each other for MPI communications". As far as I can tell all of the nodes are properly configured and able to reach each other, via IP and non-IP connections.
> I've also discovered that even if I turn off the IB transport via "--mca btl tcp,self" I'm still getting the same issue.
> The test works fine if I run it confined to hosts with identical IB cards.
> I'd appreciate some assistance in figuring out what I'm doing wrong.
I assume the QLogic cards are using PSM. You'd need to force them to
use openib with something like --mca mtl ^psm and make sure they have
the ipathverbs library available. You probably won't like the resulting
performance -- users here noticed when one set fell back to openib from
users mailing list