Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-12-04 08:35:17


One thing that I note is that you are using a fairly ancient
development version -- the development snapshots tend to change
pretty quickly (usually nightly). The version you cited is
1.2a1r10111 (which I think is about 5-6 months ago), but the current
development head is r12737.

Indeed, we've had some fairly important run-time changes over the
past 2-3 weeks. Can you update to a more recent copy and try again?

On Dec 4, 2006, at 3:44 AM, Jens Klostermann wrote:

>> What the system is saying is that (a) you don't have transparent ssh
>> authority on one or more of your nodes, and/or (b) the system was
> unable
>> to
>> locate the Open MPI code libraries on the remote node. For the first
>> problem, please see the FAQ at:
>
>> http://www.open-mpi.org/faq/?category=rsh#ssh-keys
>>
>>
>> Once you have that fixed, then you should check the remote nodes to
>> ensure
>> that the Open MPI code libraries are available - you may need to
> provide
>> a
>> prefix directory to mpirun to tell us where they are. Please see the
> FAQ
>> at:
>>
>>
>> http://www.open-mpi.org/faq/?category=running
>>
>>
>> For some advice in that area.
>>
>>
>> Hope that helps
>> Ralph
>
> I think these suggestions: (a) nontransparent ssh authority and (b)
> being unable to locate the Open MPI code libraries on the remote node
> are not the problems.
> (a)Passwordless ssh is setup and all nodes see the same home!
> (b)the Open MPI code libraries are located in my home which sees every
> node.
>
> mpirun sometimes works with all cpus/nodes of the cluster, but
> sometimes
> it won't and the error described below will occur.
>>
>
>>
>> On 12/1/06 8:17 AM, "Jens Klostermann"
>> <jens.klostermann_at_[hidden]> wrote:
>>
>>
>>> I 've got the same problem as described in:
>>> http://www.open-mpi.org/community/lists/users/2006/07/1537.php
>>>
>>> From: Chengwen Chen (chenchengwen_at_[hidden])
>>> Date: 2006-07-04 03:53:26
>>>
>>>
>>>
>>> The problem seems to occur randomly! It occurs more often if I use a
>>> larger number of cpu, but always never if I use a small number of
>> cpus.
>>> So far my cure to the problem is to kill and restart my application
>>> (mpirun ...) as often untill the error won't occur and mpirun will
>> run.
>>>
>>> So is the problem resolved. Can anybody give me an hint?
>>>
>>> I am using a amd64 linux (suse10) cluster with infiniband conection
>> and
>>> openmpi-1.2a1r10111.
>>>
>>> I attach the ompi_info --param all all output, hope it helps.
>>>
>>> Regards Jens
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems