Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Hugh Dickinson (h.j.dickinson_at_[hidden])
Date: 2009-04-28 18:57:32


As far as I can tell, both the PATH and LD_LIBRARY_PATH are set
correctly. I've tried with the full path to the mpirun executable and
using the --prefix command line option. Neither works. The debug
output seems to contain a lot of system specific information (IPs,
usernames and such), which I'm a little reticent to share on an open
mailing list. As such I've censored that information. Hopefully the
rest is still of use. One thing I did notice is that Open MPI seems to
want to use sh instead of bash (which is the shell I use). Is that
what's meant by the following lines?

[gamma2.censored_domain:22554] pls:rsh: local csh: 0, local sh: 1
[gamma2.censored_domain:22554] pls:rsh: assuming same remote shell as
local shell
[gamma2.censored_domain:22554] pls:rsh: remote csh: 0, remote sh: 1

If so is there a way to make it use bash?

Cheers,

Hugh

On 28 Apr 2009, at 22:30, Ralph Castain wrote:

> Okay, that's one small step forward. You can lock that in by setting
> the appropriate MCA parameter in one of two ways:
>
> 1. add the following to your default mca parameter file: btl =
> tcp,sm,self (I added the shared memory subsystem as this will help
> with performance). You can see how to do this here:
>
> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>
> 2. add OMPI_MCA_btl=tcp,sm,self to the environment in your .cshrc
> (or equivalent) file
>
> Next, have you looked at the following FAQ:
>
> http://www.open-mpi.org/faq/?category=running#missing-prereqs
>
> Are those things all okay? Have you tried providing a complete
> absolute path when running mpirun (e.g., using /usr/local/openmpi/
> bin/mpirun instead of just mpirun on the cmd line)?
>
> Another thing to try: add --debug-devel to the cmd line and send us
> the (probably verbose) output.
>
>
> On Apr 28, 2009, at 3:19 PM, Hugh Dickinson wrote:
>
>> Hi,
>>
>> Yes I'm using ethernet connections. Doing as you suggest removes
>> the errors generated by running the small test program, but still
>> doesn't allow programs (including the small test program) to
>> execute on any node other than the one launching mpirun. If I try
>> to do that, the command hangs until I interrupt it, whereupon it
>> gives the same timeout errors. It seems that there must be some
>> problem with the setup of my Open MPI installation. Do you agree,
>> and do you have any idea what it is? Also, is there a global
>> settings file so I can instruct Open MPI to always only try ethernet?
>>
>> Cheers,
>>
>> Hugh
>>
>> On 28 Apr 2009, at 20:12, Ralph Castain wrote:
>>
>>> In this instance, OMPI is complaining that you are attempting to
>>> use Infiniband, but no suitable devices are found.
>>>
>>> I assume you have Ethernet between your nodes? Can you run this
>>> with the following added to your mpirun cmd line:
>>>
>>> -mca btl tcp,self
>>>
>>> That will cause OMPI to ignore the Infiniband subsystem and
>>> attempt to run via TCP over any available Ethernet.
>>>
>>>
>>>
>>> On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson <h.j.dickinson_at_[hidden]
>>> > wrote:
>>> Many thanks for your help nonetheless.
>>>
>>> Hugh
>>>
>>>
>>> On 28 Apr 2009, at 17:23, jody wrote:
>>>
>>> Hi Hugh
>>>
>>> I'm sorry, but i must admit that i have never encountered these
>>> messages,
>>> and i don't know what their cause exactly is.
>>>
>>> Perhaps one of the developers can give an explanation?
>>>
>>> Jody
>>>
>>> On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
>>> <h.j.dickinson_at_[hidden]> wrote:
>>> Hi again,
>>>
>>> I tried a simple mpi c++ program:
>>>
>>> --
>>> #include <iostream>
>>> #include <mpi.h>
>>>
>>> using namespace MPI;
>>> using namespace std;
>>>
>>> int main(int argc, char* argv[]) {
>>> int rank,size;
>>> Init(argc,argv);
>>> rank=COMM_WORLD.Get_rank();
>>> size=COMM_WORLD.Get_size();
>>> cout << "P:" << rank << " out of " << size << endl;
>>> Finalize();
>>> }
>>> --
>>> It didn't work over all the nodes, again same problem - the system
>>> seems to
>>> hang. However, by forcing mpirun to use only the node on which I'm
>>> launching mpirun I get some more error messages
>>>
>>> --
>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>> --------------------------------------------------------------------------
>>> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>> --------------------------------------------------------------------------
>>> --
>>>
>>> However, as before the program does work in this special case, and
>>> I get:
>>> --
>>> P:0 out of 2
>>> P:1 out of 2
>>> --
>>>
>>> Do these errors indicate a problem with the Open MPI installation?
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
>>>
>>> Hi Jody,
>>>
>>> I can paswordlessly ssh between all nodes (to and from)
>>> Almost none of these mpirun commands work. The only working case
>>> is if
>>> nodenameX is the node from which you are running the command. I
>>> don't know
>>> if this gives you extra diagnostic information, but if I
>>> explicitly set the
>>> wrong prefix (using --prefix), then I get errors from all the
>>> nodes telling
>>> me the daemon would not start. I don't get these errors normally.
>>> It seems
>>> to me that the communication is working okay, at least in the
>>> outwards
>>> direction (and from all nodes). Could this be a problem with
>>> forwarding of
>>> standard output? If I were to try a simple hello world program, is
>>> this more
>>> likely to work, or am I just adding another layer of complexity?
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 15:55, jody wrote:
>>>
>>> Hi Hugh
>>> You're right, there is no initialization command (like lamboot) you
>>> have to call.
>>>
>>> I don't really know why your sewtup doesn't work, so i'm making some
>>> more "blind shots"
>>>
>>> can you do passwordless ssh from between any two of your nodes?
>>>
>>> does
>>> mpirun -np 1 --host nodenameX uptime
>>> work for every X when called from any of your nodes?
>>>
>>> Have you tried
>>> mpirun -np 2 --host nodename1,nodename2 uptime
>>> (i.e. not using the host file)
>>>
>>> Jody
>>>
>>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>>> <h.j.dickinson_at_[hidden]> wrote:
>>>
>>> Hi Jody,
>>>
>>> The node names are exactly the same. I wanted to avoid updating the
>>> version
>>> because I'm not the system administrator, and it could take some
>>> time
>>> before
>>> it gets done. If it's likely to fix the problem though I'll try
>>> it. I'm
>>> assuming that I don't have to do something analogous to the old
>>> "lamboot"
>>> command to initialise Open MPI on all the nodes. I've seen no
>>> documentation
>>> anywhere that says I should.
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>
>>> Hi Hugh
>>>
>>> Again, just to make sure, are the hostnames in your host file
>>> well-known?
>>> I.e. when you say you can do
>>> ssh nodename uptime
>>> do you use exactly the same nodename in your host file?
>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>> because with your setup it should basically work.)
>>>
>>> One more point to consider is to update to Open-MPI 1.3.
>>> I don't think your OPen-MPI version is the cause of your trouble,
>>> but there have been quite some changes since v1.2.5
>>>
>>> Jody
>>>
>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>> <h.j.dickinson_at_[hidden]> wrote:
>>>
>>> Hi Jody,
>>>
>>> Indeed, all the nodes are running the same version of Open MPI.
>>> Perhaps I
>>> was incorrect to describe the cluster as heterogeneous. In fact, all
>>> the
>>> nodes run the same operating system (Scientific Linux 5.2), it's
>>> only
>>> the
>>> hardware that's different and even then they're all i386 or i686.
>>> I'm
>>> also
>>> attaching the output of ompi_info --all as I've seen it's
>>> suggested in
>>> the
>>> mailing list instructions.
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> Hi Hugh
>>>
>>> Just to make sure:
>>> You have installed Open-MPI on all your nodes?
>>> Same version everywhere?
>>>
>>> Jody
>>>
>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>> <h.j.dickinson_at_[hidden]> wrote:
>>>
>>> Hi all,
>>>
>>> First of all let me make it perfectly clear that I'm a complete
>>> beginner
>>> as
>>> far as MPI is concerned, so this may well be a trivial problem!
>>>
>>> I've tried to set up Open MPI to use SSH to communicate between
>>> nodes
>>> on
>>> a
>>> heterogeneous cluster. I've set up passwordless SSH and it seems to
>>> be
>>> working fine. For example by hand I can do:
>>>
>>> ssh nodename uptime
>>>
>>> and it returns the appropriate information for each node.
>>> I then tried running a non-MPI program on all the nodes at the same
>>> time:
>>>
>>> mpirun -np 10 --hostfile hostfile uptime
>>>
>>> Where hostfile is a list of the 10 cluster node names with slots=1
>>> after
>>> each one i.e
>>>
>>> nodename1 slots=1
>>> nodename2 slots=2
>>> etc...
>>>
>>> Nothing happens! The process just seems to hang. If I interrupt the
>>> process
>>> with Ctrl-C I get:
>>>
>>> "
>>>
>>> mpirun: killing job...
>>>
>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>>> file
>>> base/pls_base_orted_cmds.c at line 275
>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>>> file
>>> pls_rsh_module.c at line 1166
>>>
>>>
>>> --------------------------------------------------------------------------
>>> WARNING: mpirun has exited before it received notification that all
>>> started processes had terminated. You should double check and
>>> ensure
>>> that there are no runaway processes still executing.
>>>
>>>
>>> --------------------------------------------------------------------------
>>>
>>> "
>>>
>>> If, instead of using the hostfile, I specify on the command line the
>>> host
>>> from which I'm running mpirun, e.g.:
>>>
>>> mpirun -np 1 --host nodename uptime
>>>
>>> then it works (i.e. if it doesn't need to communicate with other
>>> nodes).
>>> Do
>>> I need to tell Open MPI it should be using SSH to communicate? If
>>> so,
>>> how
>>> do
>>> I do this? To be honest I think it's trying to do so, because before
>>> I
>>> set
>>> up passwordless SSH it challenged me for lots of passwords.
>>>
>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let
>>> me
>>> reiterate, it's very likely that I've done something stupid, so all
>>> suggestions are welcome.
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users