Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes
From: 8mj6tc902_at_[hidden]
Date: 2008-01-16 02:20:50


> Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes
> From: M D Jones (jonesm_at_[hidden])
> Date: 2008-01-15 14:07:19
> Hmm, that is the way that I expected it to work as well -
> we see the warnings also, but closely followed by the
> errors (I've been trying both 1.2.5 and a recent 1.3
> snapshot with the same behavior). You don't have the
> mx driver loaded on the nodes that do not have a myrinet
> card, do you?

Well, the driver isn't "loaded" (ie: the kernel module isn't loaded),
but the library (libmyriexpress.so) is available. If that library isn't
available, OpenMPI will probably fail when it tries to call the mx
functions (even if only to find there's no myrinet card available).

> Our mx is a touch behind yours (1.2.3),
> but I agree that it appears to be something in the process
> startup that is at fault, so it doesn't seem likely that
> the mx version is to blame (perhaps just the fact that it
> is not installed on those nodes?).
>
> Matt
>
> On Wed, 16 Jan 2008, 8mj6tc902_at_[hidden] wrote:
>
>> We also have a mixed myrinet/ip cluster, and maybe I'm missing some
>> nuance of your configuration, but openmpi seems to work fine for me "as
>> is" with no --mca options across mixed nodes (there's a bunch of
>> warnings at the beginning where the non-mx nodes realize they don't have
>> myrinet cards and the mx nodes realize they can't talk mx to the non-mx
>> nodes, but everything completes fine, so I assumed OpenMPI was working
>> things out the transport details on it's own (and was quite pleased
>> about that)).
>>
>> I just did a quick test to confirm that it is in fact still using mx in
>> that situation, and it is. I'm running OpenMPI 1.2.4 and MX 1.2.3.
>>
>> It sounds to me based on those "PML add procs failed" messages that
>> OpenMPI is dying on start up on the non-mx nodes unless you explicitly
>> disable mx at runtime (perhaps because they're expecting the mx library
>> to be there, but it's not?)
>>
>> users-request-at-open-mpi.org |openmpi-users/Allow| wrote:
>>> Date: Tue, 15 Jan 2008 10:25:00 -0500 (EST)
>>> From: M D Jones <jonesm_at_[hidden]>
>>> Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes
>>> To: Open MPI Users <users_at_[hidden]>
>>> Message-ID: <Pine.LNX.4.64.0801151018430.18528_at_[hidden]>
>>> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>>>
>>>
>>> Hmm, that combination seems to hang on me - but
>>> '--mca pml ob1 --mca btl ^mx' does indeed do the trick.
>>> Many thanks!
>>>
>>> Matt
>>>
>>> On Tue, 15 Jan 2008, George Bosilca wrote:
>>>
>>>> This case actually works. We run into it few days ago, when we discovered
>>>> that one of the compute nodes in a cluster didn't get his Myrinet card
>>>> installed properly ... The performance were horrible but the application run
>>>> to completion.
>>>>
>>>> You will have to use the following flags: --mca pml ob1 --mca btl mx,tcp,self
>>>>
>>
>>
>>

-- 
--Kris
叶ってしまう瘢雹夢は本当の夢と言えん。
[A dream that comes true can't really be called a dream.]