Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-28 19:11:57


Best I can tell, the remote orted never got executed - it looks to me
like there is something that blocks the ssh from working. Can you get
into another window and ssh to the remote node? If so, can you do a ps
and verify that the orted is actually running there?

mpirun is using the same shell on the remote end as you are using when
you start it. One thing I see that is strange is your entire
environment is being sent along - I'll have to look at the 1.2.x code
as I didn't think we were doing that (been a long time since I looked
though).

On Apr 28, 2009, at 4:57 PM, Hugh Dickinson wrote:

> As far as I can tell, both the PATH and LD_LIBRARY_PATH are set
> correctly. I've tried with the full path to the mpirun executable
> and using the --prefix command line option. Neither works. The debug
> output seems to contain a lot of system specific information (IPs,
> usernames and such), which I'm a little reticent to share on an open
> mailing list. As such I've censored that information. Hopefully the
> rest is still of use. One thing I did notice is that Open MPI seems
> to want to use sh instead of bash (which is the shell I use). Is
> that what's meant by the following lines?
>
> [gamma2.censored_domain:22554] pls:rsh: local csh: 0, local sh: 1
> [gamma2.censored_domain:22554] pls:rsh: assuming same remote shell
> as local shell
> [gamma2.censored_domain:22554] pls:rsh: remote csh: 0, remote sh: 1
>
> If so is there a way to make it use bash?
>
> Cheers,
>
> Hugh
> <debug_output>
>
> On 28 Apr 2009, at 22:30, Ralph Castain wrote:
>
>> Okay, that's one small step forward. You can lock that in by
>> setting the appropriate MCA parameter in one of two ways:
>>
>> 1. add the following to your default mca parameter file: btl =
>> tcp,sm,self (I added the shared memory subsystem as this will help
>> with performance). You can see how to do this here:
>>
>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>
>> 2. add OMPI_MCA_btl=tcp,sm,self to the environment in your .cshrc
>> (or equivalent) file
>>
>> Next, have you looked at the following FAQ:
>>
>> http://www.open-mpi.org/faq/?category=running#missing-prereqs
>>
>> Are those things all okay? Have you tried providing a complete
>> absolute path when running mpirun (e.g., using /usr/local/openmpi/
>> bin/mpirun instead of just mpirun on the cmd line)?
>>
>> Another thing to try: add --debug-devel to the cmd line and send us
>> the (probably verbose) output.
>>
>>
>> On Apr 28, 2009, at 3:19 PM, Hugh Dickinson wrote:
>>
>>> Hi,
>>>
>>> Yes I'm using ethernet connections. Doing as you suggest removes
>>> the errors generated by running the small test program, but still
>>> doesn't allow programs (including the small test program) to
>>> execute on any node other than the one launching mpirun. If I try
>>> to do that, the command hangs until I interrupt it, whereupon it
>>> gives the same timeout errors. It seems that there must be some
>>> problem with the setup of my Open MPI installation. Do you agree,
>>> and do you have any idea what it is? Also, is there a global
>>> settings file so I can instruct Open MPI to always only try
>>> ethernet?
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 20:12, Ralph Castain wrote:
>>>
>>>> In this instance, OMPI is complaining that you are attempting to
>>>> use Infiniband, but no suitable devices are found.
>>>>
>>>> I assume you have Ethernet between your nodes? Can you run this
>>>> with the following added to your mpirun cmd line:
>>>>
>>>> -mca btl tcp,self
>>>>
>>>> That will cause OMPI to ignore the Infiniband subsystem and
>>>> attempt to run via TCP over any available Ethernet.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson <h.j.dickinson_at_[hidden]
>>>> > wrote:
>>>> Many thanks for your help nonetheless.
>>>>
>>>> Hugh
>>>>
>>>>
>>>> On 28 Apr 2009, at 17:23, jody wrote:
>>>>
>>>> Hi Hugh
>>>>
>>>> I'm sorry, but i must admit that i have never encountered these
>>>> messages,
>>>> and i don't know what their cause exactly is.
>>>>
>>>> Perhaps one of the developers can give an explanation?
>>>>
>>>> Jody
>>>>
>>>> On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>> Hi again,
>>>>
>>>> I tried a simple mpi c++ program:
>>>>
>>>> --
>>>> #include <iostream>
>>>> #include <mpi.h>
>>>>
>>>> using namespace MPI;
>>>> using namespace std;
>>>>
>>>> int main(int argc, char* argv[]) {
>>>> int rank,size;
>>>> Init(argc,argv);
>>>> rank=COMM_WORLD.Get_rank();
>>>> size=COMM_WORLD.Get_size();
>>>> cout << "P:" << rank << " out of " << size << endl;
>>>> Finalize();
>>>> }
>>>> --
>>>> It didn't work over all the nodes, again same problem - the
>>>> system seems to
>>>> hang. However, by forcing mpirun to use only the node on which I'm
>>>> launching mpirun I get some more error messages
>>>>
>>>> --
>>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>>> --------------------------------------------------------------------------
>>>> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
>>>> Another transport will be used instead, although this may result in
>>>> lower performance.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
>>>> Another transport will be used instead, although this may result in
>>>> lower performance.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
>>>> Another transport will be used instead, although this may result in
>>>> lower performance.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
>>>> Another transport will be used instead, although this may result in
>>>> lower performance.
>>>> --------------------------------------------------------------------------
>>>> --
>>>>
>>>> However, as before the program does work in this special case,
>>>> and I get:
>>>> --
>>>> P:0 out of 2
>>>> P:1 out of 2
>>>> --
>>>>
>>>> Do these errors indicate a problem with the Open MPI installation?
>>>>
>>>> Hugh
>>>>
>>>> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
>>>>
>>>> Hi Jody,
>>>>
>>>> I can paswordlessly ssh between all nodes (to and from)
>>>> Almost none of these mpirun commands work. The only working case
>>>> is if
>>>> nodenameX is the node from which you are running the command. I
>>>> don't know
>>>> if this gives you extra diagnostic information, but if I
>>>> explicitly set the
>>>> wrong prefix (using --prefix), then I get errors from all the
>>>> nodes telling
>>>> me the daemon would not start. I don't get these errors normally.
>>>> It seems
>>>> to me that the communication is working okay, at least in the
>>>> outwards
>>>> direction (and from all nodes). Could this be a problem with
>>>> forwarding of
>>>> standard output? If I were to try a simple hello world program,
>>>> is this more
>>>> likely to work, or am I just adding another layer of complexity?
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> On 28 Apr 2009, at 15:55, jody wrote:
>>>>
>>>> Hi Hugh
>>>> You're right, there is no initialization command (like lamboot)
>>>> you
>>>> have to call.
>>>>
>>>> I don't really know why your sewtup doesn't work, so i'm making
>>>> some
>>>> more "blind shots"
>>>>
>>>> can you do passwordless ssh from between any two of your nodes?
>>>>
>>>> does
>>>> mpirun -np 1 --host nodenameX uptime
>>>> work for every X when called from any of your nodes?
>>>>
>>>> Have you tried
>>>> mpirun -np 2 --host nodename1,nodename2 uptime
>>>> (i.e. not using the host file)
>>>>
>>>> Jody
>>>>
>>>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>
>>>> Hi Jody,
>>>>
>>>> The node names are exactly the same. I wanted to avoid updating the
>>>> version
>>>> because I'm not the system administrator, and it could take some
>>>> time
>>>> before
>>>> it gets done. If it's likely to fix the problem though I'll try
>>>> it. I'm
>>>> assuming that I don't have to do something analogous to the old
>>>> "lamboot"
>>>> command to initialise Open MPI on all the nodes. I've seen no
>>>> documentation
>>>> anywhere that says I should.
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>>
>>>> Hi Hugh
>>>>
>>>> Again, just to make sure, are the hostnames in your host file
>>>> well-known?
>>>> I.e. when you say you can do
>>>> ssh nodename uptime
>>>> do you use exactly the same nodename in your host file?
>>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>>> because with your setup it should basically work.)
>>>>
>>>> One more point to consider is to update to Open-MPI 1.3.
>>>> I don't think your OPen-MPI version is the cause of your trouble,
>>>> but there have been quite some changes since v1.2.5
>>>>
>>>> Jody
>>>>
>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>
>>>> Hi Jody,
>>>>
>>>> Indeed, all the nodes are running the same version of Open MPI.
>>>> Perhaps I
>>>> was incorrect to describe the cluster as heterogeneous. In fact,
>>>> all
>>>> the
>>>> nodes run the same operating system (Scientific Linux 5.2), it's
>>>> only
>>>> the
>>>> hardware that's different and even then they're all i386 or i686.
>>>> I'm
>>>> also
>>>> attaching the output of ompi_info --all as I've seen it's
>>>> suggested in
>>>> the
>>>> mailing list instructions.
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> Hi Hugh
>>>>
>>>> Just to make sure:
>>>> You have installed Open-MPI on all your nodes?
>>>> Same version everywhere?
>>>>
>>>> Jody
>>>>
>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> First of all let me make it perfectly clear that I'm a complete
>>>> beginner
>>>> as
>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>
>>>> I've tried to set up Open MPI to use SSH to communicate between
>>>> nodes
>>>> on
>>>> a
>>>> heterogeneous cluster. I've set up passwordless SSH and it seems to
>>>> be
>>>> working fine. For example by hand I can do:
>>>>
>>>> ssh nodename uptime
>>>>
>>>> and it returns the appropriate information for each node.
>>>> I then tried running a non-MPI program on all the nodes at the same
>>>> time:
>>>>
>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>
>>>> Where hostfile is a list of the 10 cluster node names with slots=1
>>>> after
>>>> each one i.e
>>>>
>>>> nodename1 slots=1
>>>> nodename2 slots=2
>>>> etc...
>>>>
>>>> Nothing happens! The process just seems to hang. If I interrupt the
>>>> process
>>>> with Ctrl-C I get:
>>>>
>>>> "
>>>>
>>>> mpirun: killing job...
>>>>
>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout
>>>> in
>>>> file
>>>> base/pls_base_orted_cmds.c at line 275
>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout
>>>> in
>>>> file
>>>> pls_rsh_module.c at line 1166
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> WARNING: mpirun has exited before it received notification that all
>>>> started processes had terminated. You should double check and
>>>> ensure
>>>> that there are no runaway processes still executing.
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> "
>>>>
>>>> If, instead of using the hostfile, I specify on the command line
>>>> the
>>>> host
>>>> from which I'm running mpirun, e.g.:
>>>>
>>>> mpirun -np 1 --host nodename uptime
>>>>
>>>> then it works (i.e. if it doesn't need to communicate with other
>>>> nodes).
>>>> Do
>>>> I need to tell Open MPI it should be using SSH to communicate? If
>>>> so,
>>>> how
>>>> do
>>>> I do this? To be honest I think it's trying to do so, because
>>>> before
>>>> I
>>>> set
>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>
>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let
>>>> me
>>>> reiterate, it's very likely that I've done something stupid, so all
>>>> suggestions are welcome.
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users