Hmm, that is the way that I expected it to work as well -
we see the warnings also, but closely followed by the
errors (I've been trying both 1.2.5 and a recent 1.3
snapshot with the same behavior). You don't have the
mx driver loaded on the nodes that do not have a myrinet
card, do you? Our mx is a touch behind yours (1.2.3),
but I agree that it appears to be something in the process
startup that is at fault, so it doesn't seem likely that
the mx version is to blame (perhaps just the fact that it
is not installed on those nodes?).
On Wed, 16 Jan 2008, 8mj6tc902_at_[hidden] wrote:
> We also have a mixed myrinet/ip cluster, and maybe I'm missing some
> nuance of your configuration, but openmpi seems to work fine for me "as
> is" with no --mca options across mixed nodes (there's a bunch of
> warnings at the beginning where the non-mx nodes realize they don't have
> myrinet cards and the mx nodes realize they can't talk mx to the non-mx
> nodes, but everything completes fine, so I assumed OpenMPI was working
> things out the transport details on it's own (and was quite pleased
> about that)).
> I just did a quick test to confirm that it is in fact still using mx in
> that situation, and it is. I'm running OpenMPI 1.2.4 and MX 1.2.3.
> It sounds to me based on those "PML add procs failed" messages that
> OpenMPI is dying on start up on the non-mx nodes unless you explicitly
> disable mx at runtime (perhaps because they're expecting the mx library
> to be there, but it's not?)
> users-request-at-open-mpi.org |openmpi-users/Allow| wrote:
>> Date: Tue, 15 Jan 2008 10:25:00 -0500 (EST)
>> From: M D Jones <jonesm_at_[hidden]>
>> Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <Pine.LNX.4.64.0801151018430.18528_at_[hidden]>
>> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>> Hmm, that combination seems to hang on me - but
>> '--mca pml ob1 --mca btl ^mx' does indeed do the trick.
>> Many thanks!
>> On Tue, 15 Jan 2008, George Bosilca wrote:
>>> This case actually works. We run into it few days ago, when we discovered
>>> that one of the compute nodes in a cluster didn't get his Myrinet card
>>> installed properly ... The performance were horrible but the application run
>>> to completion.
>>> You will have to use the following flags: --mca pml ob1 --mca btl mx,tcp,self