Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: jody (jody.xha_at_[hidden])
Date: 2009-04-28 10:28:16


Hi Hugh

Again, just to make sure, are the hostnames in your host file well-known?
I.e. when you say you can do
  ssh nodename uptime
do you use exactly the same nodename in your host file?
(I'm trying to eliminate all non-Open-MPI error sources,
because with your setup it should basically work.)

One more point to consider is to update to Open-MPI 1.3.
I don't think your OPen-MPI version is the cause of your trouble,
but there have been quite some changes since v1.2.5

Jody

On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
<h.j.dickinson_at_[hidden]> wrote:
> Hi Jody,
>
> Indeed, all the nodes are running the same version of Open MPI. Perhaps I
> was incorrect to describe the cluster as heterogeneous. In fact, all the
> nodes run the same operating system (Scientific Linux 5.2), it's only the
> hardware that's different and even then they're all i386 or i686. I'm also
> attaching the output of ompi_info --all as I've seen it's suggested in the
> mailing list instructions.
>
> Cheers,
>
> Hugh
>
> Hi Hugh
>
> Just to make sure:
> You have installed Open-MPI on all your nodes?
> Same version everywhere?
>
> Jody
>
> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
> <h.j.dickinson_at_[hidden]> wrote:
>> Hi all,
>>
>> First of all let me make it perfectly clear that I'm a complete beginner
>> as
>> far as MPI is concerned, so this may well be a trivial problem!
>>
>> I've tried to set up Open MPI to use SSH to communicate between nodes on a
>> heterogeneous cluster. I've set up passwordless SSH and it seems to be
>> working fine. For example by hand I can do:
>>
>> ssh nodename uptime
>>
>> and it returns the appropriate information for each node.
>> I then tried running a non-MPI program on all the nodes at the same time:
>>
>> mpirun -np 10 --hostfile hostfile uptime
>>
>> Where hostfile is a list of the 10 cluster node names with slots=1 after
>> each one i.e
>>
>> nodename1 slots=1
>> nodename2 slots=2
>> etc...
>>
>> Nothing happens! The process just seems to hang. If I interrupt the
>> process
>> with Ctrl-C I get:
>>
>> "
>>
>> mpirun: killing job...
>>
>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> pls_rsh_module.c at line 1166
>> --------------------------------------------------------------------------
>> WARNING: mpirun has exited before it received notification that all
>> started processes had terminated.  You should double check and ensure
>> that there are no runaway processes still executing.
>> --------------------------------------------------------------------------
>>
>> "
>>
>> If, instead of using the hostfile, I specify on the command line the host
>> from which I'm running mpirun, e.g.:
>>
>> mpirun -np 1 --host nodename uptime
>>
>> then it works (i.e. if it doesn't need to communicate with other nodes).
>> Do
>> I need to tell Open MPI it should be using SSH to communicate? If so, how
>> do
>> I do this? To be honest I think it's trying to do so, because before I set
>> up passwordless SSH it challenged me for lots of passwords.
>>
>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let me
>> reiterate, it's very likely that I've done something stupid, so all
>> suggestions are welcome.
>>
>> Cheers,
>>
>> Hugh
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>