Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-28 17:30:19


Okay, that's one small step forward. You can lock that in by setting
the appropriate MCA parameter in one of two ways:

1. add the following to your default mca parameter file: btl =
tcp,sm,self (I added the shared memory subsystem as this will help
with performance). You can see how to do this here:

http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

2. add OMPI_MCA_btl=tcp,sm,self to the environment in your .cshrc (or
equivalent) file

Next, have you looked at the following FAQ:

http://www.open-mpi.org/faq/?category=running#missing-prereqs

Are those things all okay? Have you tried providing a complete
absolute path when running mpirun (e.g., using /usr/local/openmpi/bin/
mpirun instead of just mpirun on the cmd line)?

Another thing to try: add --debug-devel to the cmd line and send us
the (probably verbose) output.

On Apr 28, 2009, at 3:19 PM, Hugh Dickinson wrote:

> Hi,
>
> Yes I'm using ethernet connections. Doing as you suggest removes the
> errors generated by running the small test program, but still
> doesn't allow programs (including the small test program) to execute
> on any node other than the one launching mpirun. If I try to do
> that, the command hangs until I interrupt it, whereupon it gives the
> same timeout errors. It seems that there must be some problem with
> the setup of my Open MPI installation. Do you agree, and do you have
> any idea what it is? Also, is there a global settings file so I can
> instruct Open MPI to always only try ethernet?
>
> Cheers,
>
> Hugh
>
> On 28 Apr 2009, at 20:12, Ralph Castain wrote:
>
>> In this instance, OMPI is complaining that you are attempting to
>> use Infiniband, but no suitable devices are found.
>>
>> I assume you have Ethernet between your nodes? Can you run this
>> with the following added to your mpirun cmd line:
>>
>> -mca btl tcp,self
>>
>> That will cause OMPI to ignore the Infiniband subsystem and attempt
>> to run via TCP over any available Ethernet.
>>
>>
>>
>> On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson <h.j.dickinson_at_[hidden]
>> > wrote:
>> Many thanks for your help nonetheless.
>>
>> Hugh
>>
>>
>> On 28 Apr 2009, at 17:23, jody wrote:
>>
>> Hi Hugh
>>
>> I'm sorry, but i must admit that i have never encountered these
>> messages,
>> and i don't know what their cause exactly is.
>>
>> Perhaps one of the developers can give an explanation?
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>> Hi again,
>>
>> I tried a simple mpi c++ program:
>>
>> --
>> #include <iostream>
>> #include <mpi.h>
>>
>> using namespace MPI;
>> using namespace std;
>>
>> int main(int argc, char* argv[]) {
>> int rank,size;
>> Init(argc,argv);
>> rank=COMM_WORLD.Get_rank();
>> size=COMM_WORLD.Get_size();
>> cout << "P:" << rank << " out of " << size << endl;
>> Finalize();
>> }
>> --
>> It didn't work over all the nodes, again same problem - the system
>> seems to
>> hang. However, by forcing mpirun to use only the node on which I'm
>> launching mpirun I get some more error messages
>>
>> --
>> libibverbs: Fatal: couldn't read uverbs ABI version.
>> libibverbs: Fatal: couldn't read uverbs ABI version.
>> --------------------------------------------------------------------------
>> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --
>>
>> However, as before the program does work in this special case, and
>> I get:
>> --
>> P:0 out of 2
>> P:1 out of 2
>> --
>>
>> Do these errors indicate a problem with the Open MPI installation?
>>
>> Hugh
>>
>> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
>>
>> Hi Jody,
>>
>> I can paswordlessly ssh between all nodes (to and from)
>> Almost none of these mpirun commands work. The only working case is
>> if
>> nodenameX is the node from which you are running the command. I
>> don't know
>> if this gives you extra diagnostic information, but if I explicitly
>> set the
>> wrong prefix (using --prefix), then I get errors from all the nodes
>> telling
>> me the daemon would not start. I don't get these errors normally.
>> It seems
>> to me that the communication is working okay, at least in the
>> outwards
>> direction (and from all nodes). Could this be a problem with
>> forwarding of
>> standard output? If I were to try a simple hello world program, is
>> this more
>> likely to work, or am I just adding another layer of complexity?
>>
>> Cheers,
>>
>> Hugh
>>
>> On 28 Apr 2009, at 15:55, jody wrote:
>>
>> Hi Hugh
>> You're right, there is no initialization command (like lamboot) you
>> have to call.
>>
>> I don't really know why your sewtup doesn't work, so i'm making some
>> more "blind shots"
>>
>> can you do passwordless ssh from between any two of your nodes?
>>
>> does
>> mpirun -np 1 --host nodenameX uptime
>> work for every X when called from any of your nodes?
>>
>> Have you tried
>> mpirun -np 2 --host nodename1,nodename2 uptime
>> (i.e. not using the host file)
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>>
>> Hi Jody,
>>
>> The node names are exactly the same. I wanted to avoid updating the
>> version
>> because I'm not the system administrator, and it could take some time
>> before
>> it gets done. If it's likely to fix the problem though I'll try it.
>> I'm
>> assuming that I don't have to do something analogous to the old
>> "lamboot"
>> command to initialise Open MPI on all the nodes. I've seen no
>> documentation
>> anywhere that says I should.
>>
>> Cheers,
>>
>> Hugh
>>
>> On 28 Apr 2009, at 15:28, jody wrote:
>>
>> Hi Hugh
>>
>> Again, just to make sure, are the hostnames in your host file
>> well-known?
>> I.e. when you say you can do
>> ssh nodename uptime
>> do you use exactly the same nodename in your host file?
>> (I'm trying to eliminate all non-Open-MPI error sources,
>> because with your setup it should basically work.)
>>
>> One more point to consider is to update to Open-MPI 1.3.
>> I don't think your OPen-MPI version is the cause of your trouble,
>> but there have been quite some changes since v1.2.5
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>>
>> Hi Jody,
>>
>> Indeed, all the nodes are running the same version of Open MPI.
>> Perhaps I
>> was incorrect to describe the cluster as heterogeneous. In fact, all
>> the
>> nodes run the same operating system (Scientific Linux 5.2), it's only
>> the
>> hardware that's different and even then they're all i386 or i686. I'm
>> also
>> attaching the output of ompi_info --all as I've seen it's suggested
>> in
>> the
>> mailing list instructions.
>>
>> Cheers,
>>
>> Hugh
>>
>> Hi Hugh
>>
>> Just to make sure:
>> You have installed Open-MPI on all your nodes?
>> Same version everywhere?
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>>
>> Hi all,
>>
>> First of all let me make it perfectly clear that I'm a complete
>> beginner
>> as
>> far as MPI is concerned, so this may well be a trivial problem!
>>
>> I've tried to set up Open MPI to use SSH to communicate between nodes
>> on
>> a
>> heterogeneous cluster. I've set up passwordless SSH and it seems to
>> be
>> working fine. For example by hand I can do:
>>
>> ssh nodename uptime
>>
>> and it returns the appropriate information for each node.
>> I then tried running a non-MPI program on all the nodes at the same
>> time:
>>
>> mpirun -np 10 --hostfile hostfile uptime
>>
>> Where hostfile is a list of the 10 cluster node names with slots=1
>> after
>> each one i.e
>>
>> nodename1 slots=1
>> nodename2 slots=2
>> etc...
>>
>> Nothing happens! The process just seems to hang. If I interrupt the
>> process
>> with Ctrl-C I get:
>>
>> "
>>
>> mpirun: killing job...
>>
>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>> file
>> base/pls_base_orted_cmds.c at line 275
>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>> file
>> pls_rsh_module.c at line 1166
>>
>>
>> --------------------------------------------------------------------------
>> WARNING: mpirun has exited before it received notification that all
>> started processes had terminated. You should double check and ensure
>> that there are no runaway processes still executing.
>>
>>
>> --------------------------------------------------------------------------
>>
>> "
>>
>> If, instead of using the hostfile, I specify on the command line the
>> host
>> from which I'm running mpirun, e.g.:
>>
>> mpirun -np 1 --host nodename uptime
>>
>> then it works (i.e. if it doesn't need to communicate with other
>> nodes).
>> Do
>> I need to tell Open MPI it should be using SSH to communicate? If so,
>> how
>> do
>> I do this? To be honest I think it's trying to do so, because before
>> I
>> set
>> up passwordless SSH it challenged me for lots of passwords.
>>
>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let
>> me
>> reiterate, it's very likely that I've done something stupid, so all
>> suggestions are welcome.
>>
>> Cheers,
>>
>> Hugh
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users