Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Hugh Dickinson (h.j.dickinson_at_[hidden])
Date: 2009-04-29 04:50:41


The remote node starts the following process when mpirun is executed
on the local node:

25734 ? Ss 0:00 /usr/lib/openmpi/1.2.5-gcc/bin/orted --
bootproxy 1 --

I checked and it was not running before mpirun was executed.

I'll look into installing a more recent version of Open MPI.

Hugh

On 29 Apr 2009, at 00:11, Ralph Castain wrote:

> Best I can tell, the remote orted never got executed - it looks to
> me like there is something that blocks the ssh from working. Can you
> get into another window and ssh to the remote node? If so, can you
> do a ps and verify that the orted is actually running there?
>
> mpirun is using the same shell on the remote end as you are using
> when you start it. One thing I see that is strange is your entire
> environment is being sent along - I'll have to look at the 1.2.x
> code as I didn't think we were doing that (been a long time since I
> looked though).
>
>
> On Apr 28, 2009, at 4:57 PM, Hugh Dickinson wrote:
>
>> As far as I can tell, both the PATH and LD_LIBRARY_PATH are set
>> correctly. I've tried with the full path to the mpirun executable
>> and using the --prefix command line option. Neither works. The
>> debug output seems to contain a lot of system specific information
>> (IPs, usernames and such), which I'm a little reticent to share on
>> an open mailing list. As such I've censored that information.
>> Hopefully the rest is still of use. One thing I did notice is that
>> Open MPI seems to want to use sh instead of bash (which is the
>> shell I use). Is that what's meant by the following lines?
>>
>> [gamma2.censored_domain:22554] pls:rsh: local csh: 0, local sh: 1
>> [gamma2.censored_domain:22554] pls:rsh: assuming same remote shell
>> as local shell
>> [gamma2.censored_domain:22554] pls:rsh: remote csh: 0, remote sh: 1
>>
>> If so is there a way to make it use bash?
>>
>> Cheers,
>>
>> Hugh
>> <debug_output>
>>
>> On 28 Apr 2009, at 22:30, Ralph Castain wrote:
>>
>>> Okay, that's one small step forward. You can lock that in by
>>> setting the appropriate MCA parameter in one of two ways:
>>>
>>> 1. add the following to your default mca parameter file: btl =
>>> tcp,sm,self (I added the shared memory subsystem as this will help
>>> with performance). You can see how to do this here:
>>>
>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>>
>>> 2. add OMPI_MCA_btl=tcp,sm,self to the environment in your .cshrc
>>> (or equivalent) file
>>>
>>> Next, have you looked at the following FAQ:
>>>
>>> http://www.open-mpi.org/faq/?category=running#missing-prereqs
>>>
>>> Are those things all okay? Have you tried providing a complete
>>> absolute path when running mpirun (e.g., using /usr/local/openmpi/
>>> bin/mpirun instead of just mpirun on the cmd line)?
>>>
>>> Another thing to try: add --debug-devel to the cmd line and send
>>> us the (probably verbose) output.
>>>
>>>
>>> On Apr 28, 2009, at 3:19 PM, Hugh Dickinson wrote:
>>>
>>>> Hi,
>>>>
>>>> Yes I'm using ethernet connections. Doing as you suggest removes
>>>> the errors generated by running the small test program, but still
>>>> doesn't allow programs (including the small test program) to
>>>> execute on any node other than the one launching mpirun. If I try
>>>> to do that, the command hangs until I interrupt it, whereupon it
>>>> gives the same timeout errors. It seems that there must be some
>>>> problem with the setup of my Open MPI installation. Do you agree,
>>>> and do you have any idea what it is? Also, is there a global
>>>> settings file so I can instruct Open MPI to always only try
>>>> ethernet?
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> On 28 Apr 2009, at 20:12, Ralph Castain wrote:
>>>>
>>>>> In this instance, OMPI is complaining that you are attempting to
>>>>> use Infiniband, but no suitable devices are found.
>>>>>
>>>>> I assume you have Ethernet between your nodes? Can you run this
>>>>> with the following added to your mpirun cmd line:
>>>>>
>>>>> -mca btl tcp,self
>>>>>
>>>>> That will cause OMPI to ignore the Infiniband subsystem and
>>>>> attempt to run via TCP over any available Ethernet.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson <h.j.dickinson_at_[hidden]
>>>>> > wrote:
>>>>> Many thanks for your help nonetheless.
>>>>>
>>>>> Hugh
>>>>>
>>>>>
>>>>> On 28 Apr 2009, at 17:23, jody wrote:
>>>>>
>>>>> Hi Hugh
>>>>>
>>>>> I'm sorry, but i must admit that i have never encountered these
>>>>> messages,
>>>>> and i don't know what their cause exactly is.
>>>>>
>>>>> Perhaps one of the developers can give an explanation?
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>> Hi again,
>>>>>
>>>>> I tried a simple mpi c++ program:
>>>>>
>>>>> --
>>>>> #include <iostream>
>>>>> #include <mpi.h>
>>>>>
>>>>> using namespace MPI;
>>>>> using namespace std;
>>>>>
>>>>> int main(int argc, char* argv[]) {
>>>>> int rank,size;
>>>>> Init(argc,argv);
>>>>> rank=COMM_WORLD.Get_rank();
>>>>> size=COMM_WORLD.Get_size();
>>>>> cout << "P:" << rank << " out of " << size << endl;
>>>>> Finalize();
>>>>> }
>>>>> --
>>>>> It didn't work over all the nodes, again same problem - the
>>>>> system seems to
>>>>> hang. However, by forcing mpirun to use only the node on which
>>>>> I'm
>>>>> launching mpirun I get some more error messages
>>>>>
>>>>> --
>>>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>>>> --------------------------------------------------------------------------
>>>>> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
>>>>> Another transport will be used instead, although this may result
>>>>> in
>>>>> lower performance.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
>>>>> Another transport will be used instead, although this may result
>>>>> in
>>>>> lower performance.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
>>>>> Another transport will be used instead, although this may result
>>>>> in
>>>>> lower performance.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
>>>>> Another transport will be used instead, although this may result
>>>>> in
>>>>> lower performance.
>>>>> --------------------------------------------------------------------------
>>>>> --
>>>>>
>>>>> However, as before the program does work in this special case,
>>>>> and I get:
>>>>> --
>>>>> P:0 out of 2
>>>>> P:1 out of 2
>>>>> --
>>>>>
>>>>> Do these errors indicate a problem with the Open MPI installation?
>>>>>
>>>>> Hugh
>>>>>
>>>>> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
>>>>>
>>>>> Hi Jody,
>>>>>
>>>>> I can paswordlessly ssh between all nodes (to and from)
>>>>> Almost none of these mpirun commands work. The only working case
>>>>> is if
>>>>> nodenameX is the node from which you are running the command. I
>>>>> don't know
>>>>> if this gives you extra diagnostic information, but if I
>>>>> explicitly set the
>>>>> wrong prefix (using --prefix), then I get errors from all the
>>>>> nodes telling
>>>>> me the daemon would not start. I don't get these errors
>>>>> normally. It seems
>>>>> to me that the communication is working okay, at least in the
>>>>> outwards
>>>>> direction (and from all nodes). Could this be a problem with
>>>>> forwarding of
>>>>> standard output? If I were to try a simple hello world program,
>>>>> is this more
>>>>> likely to work, or am I just adding another layer of complexity?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hugh
>>>>>
>>>>> On 28 Apr 2009, at 15:55, jody wrote:
>>>>>
>>>>> Hi Hugh
>>>>> You're right, there is no initialization command (like lamboot)
>>>>> you
>>>>> have to call.
>>>>>
>>>>> I don't really know why your sewtup doesn't work, so i'm making
>>>>> some
>>>>> more "blind shots"
>>>>>
>>>>> can you do passwordless ssh from between any two of your nodes?
>>>>>
>>>>> does
>>>>> mpirun -np 1 --host nodenameX uptime
>>>>> work for every X when called from any of your nodes?
>>>>>
>>>>> Have you tried
>>>>> mpirun -np 2 --host nodename1,nodename2 uptime
>>>>> (i.e. not using the host file)
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>
>>>>> Hi Jody,
>>>>>
>>>>> The node names are exactly the same. I wanted to avoid updating
>>>>> the
>>>>> version
>>>>> because I'm not the system administrator, and it could take some
>>>>> time
>>>>> before
>>>>> it gets done. If it's likely to fix the problem though I'll try
>>>>> it. I'm
>>>>> assuming that I don't have to do something analogous to the old
>>>>> "lamboot"
>>>>> command to initialise Open MPI on all the nodes. I've seen no
>>>>> documentation
>>>>> anywhere that says I should.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hugh
>>>>>
>>>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>>>
>>>>> Hi Hugh
>>>>>
>>>>> Again, just to make sure, are the hostnames in your host file
>>>>> well-known?
>>>>> I.e. when you say you can do
>>>>> ssh nodename uptime
>>>>> do you use exactly the same nodename in your host file?
>>>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>>>> because with your setup it should basically work.)
>>>>>
>>>>> One more point to consider is to update to Open-MPI 1.3.
>>>>> I don't think your OPen-MPI version is the cause of your trouble,
>>>>> but there have been quite some changes since v1.2.5
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>
>>>>> Hi Jody,
>>>>>
>>>>> Indeed, all the nodes are running the same version of Open MPI.
>>>>> Perhaps I
>>>>> was incorrect to describe the cluster as heterogeneous. In fact,
>>>>> all
>>>>> the
>>>>> nodes run the same operating system (Scientific Linux 5.2), it's
>>>>> only
>>>>> the
>>>>> hardware that's different and even then they're all i386 or
>>>>> i686. I'm
>>>>> also
>>>>> attaching the output of ompi_info --all as I've seen it's
>>>>> suggested in
>>>>> the
>>>>> mailing list instructions.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hugh
>>>>>
>>>>> Hi Hugh
>>>>>
>>>>> Just to make sure:
>>>>> You have installed Open-MPI on all your nodes?
>>>>> Same version everywhere?
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> First of all let me make it perfectly clear that I'm a complete
>>>>> beginner
>>>>> as
>>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>>
>>>>> I've tried to set up Open MPI to use SSH to communicate between
>>>>> nodes
>>>>> on
>>>>> a
>>>>> heterogeneous cluster. I've set up passwordless SSH and it seems
>>>>> to
>>>>> be
>>>>> working fine. For example by hand I can do:
>>>>>
>>>>> ssh nodename uptime
>>>>>
>>>>> and it returns the appropriate information for each node.
>>>>> I then tried running a non-MPI program on all the nodes at the
>>>>> same
>>>>> time:
>>>>>
>>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>>
>>>>> Where hostfile is a list of the 10 cluster node names with slots=1
>>>>> after
>>>>> each one i.e
>>>>>
>>>>> nodename1 slots=1
>>>>> nodename2 slots=2
>>>>> etc...
>>>>>
>>>>> Nothing happens! The process just seems to hang. If I interrupt
>>>>> the
>>>>> process
>>>>> with Ctrl-C I get:
>>>>>
>>>>> "
>>>>>
>>>>> mpirun: killing job...
>>>>>
>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:
>>>>> Timeout in
>>>>> file
>>>>> base/pls_base_orted_cmds.c at line 275
>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:
>>>>> Timeout in
>>>>> file
>>>>> pls_rsh_module.c at line 1166
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> WARNING: mpirun has exited before it received notification that
>>>>> all
>>>>> started processes had terminated. You should double check and
>>>>> ensure
>>>>> that there are no runaway processes still executing.
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> "
>>>>>
>>>>> If, instead of using the hostfile, I specify on the command line
>>>>> the
>>>>> host
>>>>> from which I'm running mpirun, e.g.:
>>>>>
>>>>> mpirun -np 1 --host nodename uptime
>>>>>
>>>>> then it works (i.e. if it doesn't need to communicate with other
>>>>> nodes).
>>>>> Do
>>>>> I need to tell Open MPI it should be using SSH to communicate?
>>>>> If so,
>>>>> how
>>>>> do
>>>>> I do this? To be honest I think it's trying to do so, because
>>>>> before
>>>>> I
>>>>> set
>>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>>
>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2.
>>>>> Let
>>>>> me
>>>>> reiterate, it's very likely that I've done something stupid, so
>>>>> all
>>>>> suggestions are welcome.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hugh
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users