Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Hugh Dickinson (h.j.dickinson_at_[hidden])
Date: 2009-04-28 09:22:53


Hi Jody,

Indeed, all the nodes are running the same version of Open MPI.
Perhaps I was incorrect to describe the cluster as heterogeneous. In
fact, all the nodes run the same operating system (Scientific Linux
5.2), it's only the hardware that's different and even then they're
all i386 or i686. I'm also attaching the output of ompi_info --all as
I've seen it's suggested in the mailing list instructions.

Cheers,

Hugh



> Hi Hugh
>
> Just to make sure:
> You have installed Open-MPI on all your nodes?
> Same version everywhere?
>
> Jody
>
>
>
> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
> <h.j.dickinson_at_[hidden]> wrote:
> > Hi all,
> >
> > First of all let me make it perfectly clear that I'm a complete
> beginner as
> > far as MPI is concerned, so this may well be a trivial problem!
> >
> > I've tried to set up Open MPI to use SSH to communicate between
> nodes on a
> > heterogeneous cluster. I've set up passwordless SSH and it seems
> to be
> > working fine. For example by hand I can do:
> >
> > ssh nodename uptime
> >
> > and it returns the appropriate information for each node.
> > I then tried running a non-MPI program on all the nodes at the
> same time:
> >
> > mpirun -np 10 --hostfile hostfile uptime
> >
> > Where hostfile is a list of the 10 cluster node names with
> slots=1 after
> > each one i.e
> >
> > nodename1 slots=1
> > nodename2 slots=2
> > etc...
> >
> > Nothing happens! The process just seems to hang. If I interrupt
> the process
> > with Ctrl-C I get:
> >
> > "
> >
> > mpirun: killing job...
> >
> > [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout
> in file
> > base/pls_base_orted_cmds.c at line 275
> > [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout
> in file
> > pls_rsh_module.c at line 1166
> >
> ----------------------------------------------------------------------
> ----
> > WARNING: mpirun has exited before it received notification that all
> > started processes had terminated. You should double check and
> ensure
> > that there are no runaway processes still executing.
> >
> ----------------------------------------------------------------------
> ----
> >
> > "
> >
> > If, instead of using the hostfile, I specify on the command line
> the host
> > from which I'm running mpirun, e.g.:
> >
> > mpirun -np 1 --host nodename uptime
> >
> > then it works (i.e. if it doesn't need to communicate with other
> nodes). Do
> > I need to tell Open MPI it should be using SSH to communicate? If
> so, how do
> > I do this? To be honest I think it's trying to do so, because
> before I set
> > up passwordless SSH it challenged me for lots of passwords.
> >
> > I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2.
> Let me
> > reiterate, it's very likely that I've done something stupid, so all
> > suggestions are welcome.
> >
> > Cheers,
> >
> > Hugh
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >