Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: jody (jody.xha_at_[hidden])
Date: 2009-04-28 10:55:25


Hi Hugh
You're right, there is no initialization command (like lamboot) you
have to call.

I don't really know why your sewtup doesn't work, so i'm making some
more "blind shots"

can you do passwordless ssh from between any two of your nodes?

does
 mpirun -np 1 --host nodenameX uptime
work for every X when called from any of your nodes?

Have you tried
   mpirun -np 2 --host nodename1,nodename2 uptime
(i.e. not using the host file)

Jody

On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
<h.j.dickinson_at_[hidden]> wrote:
> Hi Jody,
>
> The node names are exactly the same. I wanted to avoid updating the version
> because I'm not the system administrator, and it could take some time before
> it gets done. If it's likely to fix the problem though I'll try it. I'm
> assuming that I don't have to do something analogous to the old "lamboot"
> command to initialise Open MPI on all the nodes. I've seen no documentation
> anywhere that says I should.
>
> Cheers,
>
> Hugh
>
> On 28 Apr 2009, at 15:28, jody wrote:
>
>> Hi Hugh
>>
>> Again, just to make sure, are the hostnames in your host file well-known?
>> I.e. when you say you can do
>>  ssh nodename uptime
>> do you use exactly the same nodename in your host file?
>> (I'm trying to eliminate all non-Open-MPI error sources,
>> because with your setup it should basically work.)
>>
>> One more point to consider is to  update to Open-MPI 1.3.
>> I don't think your OPen-MPI version is the cause of your trouble,
>> but there have been quite some changes since v1.2.5
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>>>
>>> Hi Jody,
>>>
>>> Indeed, all the nodes are running the same version of Open MPI. Perhaps I
>>> was incorrect to describe the cluster as heterogeneous. In fact, all the
>>> nodes run the same operating system (Scientific Linux 5.2), it's only the
>>> hardware that's different and even then they're all i386 or i686. I'm
>>> also
>>> attaching the output of ompi_info --all as I've seen it's suggested in
>>> the
>>> mailing list instructions.
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> Hi Hugh
>>>
>>> Just to make sure:
>>> You have installed Open-MPI on all your nodes?
>>> Same version everywhere?
>>>
>>> Jody
>>>
>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> First of all let me make it perfectly clear that I'm a complete beginner
>>>> as
>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>
>>>> I've tried to set up Open MPI to use SSH to communicate between nodes on
>>>> a
>>>> heterogeneous cluster. I've set up passwordless SSH and it seems to be
>>>> working fine. For example by hand I can do:
>>>>
>>>> ssh nodename uptime
>>>>
>>>> and it returns the appropriate information for each node.
>>>> I then tried running a non-MPI program on all the nodes at the same
>>>> time:
>>>>
>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>
>>>> Where hostfile is a list of the 10 cluster node names with slots=1 after
>>>> each one i.e
>>>>
>>>> nodename1 slots=1
>>>> nodename2 slots=2
>>>> etc...
>>>>
>>>> Nothing happens! The process just seems to hang. If I interrupt the
>>>> process
>>>> with Ctrl-C I get:
>>>>
>>>> "
>>>>
>>>> mpirun: killing job...
>>>>
>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>>>> file
>>>> base/pls_base_orted_cmds.c at line 275
>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>>>> file
>>>> pls_rsh_module.c at line 1166
>>>>
>>>> --------------------------------------------------------------------------
>>>> WARNING: mpirun has exited before it received notification that all
>>>> started processes had terminated.  You should double check and ensure
>>>> that there are no runaway processes still executing.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> "
>>>>
>>>> If, instead of using the hostfile, I specify on the command line the
>>>> host
>>>> from which I'm running mpirun, e.g.:
>>>>
>>>> mpirun -np 1 --host nodename uptime
>>>>
>>>> then it works (i.e. if it doesn't need to communicate with other nodes).
>>>> Do
>>>> I need to tell Open MPI it should be using SSH to communicate? If so,
>>>> how
>>>> do
>>>> I do this? To be honest I think it's trying to do so, because before I
>>>> set
>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>
>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let me
>>>> reiterate, it's very likely that I've done something stupid, so all
>>>> suggestions are welcome.
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>