Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Hugh Dickinson (h.j.dickinson_at_[hidden])
Date: 2009-04-28 14:16:58


Many thanks for your help nonetheless.

Hugh

On 28 Apr 2009, at 17:23, jody wrote:

> Hi Hugh
>
> I'm sorry, but i must admit that i have never encountered these
> messages,
> and i don't know what their cause exactly is.
>
> Perhaps one of the developers can give an explanation?
>
> Jody
>
> On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
> <h.j.dickinson_at_[hidden]> wrote:
>> Hi again,
>>
>> I tried a simple mpi c++ program:
>>
>> --
>> #include <iostream>
>> #include <mpi.h>
>>
>> using namespace MPI;
>> using namespace std;
>>
>> int main(int argc, char* argv[]) {
>> int rank,size;
>> Init(argc,argv);
>> rank=COMM_WORLD.Get_rank();
>> size=COMM_WORLD.Get_size();
>> cout << "P:" << rank << " out of " << size << endl;
>> Finalize();
>> }
>> --
>> It didn't work over all the nodes, again same problem - the system
>> seems to
>> hang. However, by forcing mpirun to use only the node on which I'm
>> launching mpirun I get some more error messages
>>
>> --
>> libibverbs: Fatal: couldn't read uverbs ABI version.
>> libibverbs: Fatal: couldn't read uverbs ABI version.
>> --------------------------------------------------------------------------
>> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --
>>
>> However, as before the program does work in this special case, and
>> I get:
>> --
>> P:0 out of 2
>> P:1 out of 2
>> --
>>
>> Do these errors indicate a problem with the Open MPI installation?
>>
>> Hugh
>>
>> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
>>
>>> Hi Jody,
>>>
>>> I can paswordlessly ssh between all nodes (to and from)
>>> Almost none of these mpirun commands work. The only working case
>>> is if
>>> nodenameX is the node from which you are running the command. I
>>> don't know
>>> if this gives you extra diagnostic information, but if I
>>> explicitly set the
>>> wrong prefix (using --prefix), then I get errors from all the
>>> nodes telling
>>> me the daemon would not start. I don't get these errors normally.
>>> It seems
>>> to me that the communication is working okay, at least in the
>>> outwards
>>> direction (and from all nodes). Could this be a problem with
>>> forwarding of
>>> standard output? If I were to try a simple hello world program, is
>>> this more
>>> likely to work, or am I just adding another layer of complexity?
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 15:55, jody wrote:
>>>
>>>> Hi Hugh
>>>> You're right, there is no initialization command (like lamboot)
>>>> you
>>>> have to call.
>>>>
>>>> I don't really know why your sewtup doesn't work, so i'm making
>>>> some
>>>> more "blind shots"
>>>>
>>>> can you do passwordless ssh from between any two of your nodes?
>>>>
>>>> does
>>>> mpirun -np 1 --host nodenameX uptime
>>>> work for every X when called from any of your nodes?
>>>>
>>>> Have you tried
>>>> mpirun -np 2 --host nodename1,nodename2 uptime
>>>> (i.e. not using the host file)
>>>>
>>>> Jody
>>>>
>>>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>
>>>>> Hi Jody,
>>>>>
>>>>> The node names are exactly the same. I wanted to avoid updating
>>>>> the
>>>>> version
>>>>> because I'm not the system administrator, and it could take some
>>>>> time
>>>>> before
>>>>> it gets done. If it's likely to fix the problem though I'll try
>>>>> it. I'm
>>>>> assuming that I don't have to do something analogous to the old
>>>>> "lamboot"
>>>>> command to initialise Open MPI on all the nodes. I've seen no
>>>>> documentation
>>>>> anywhere that says I should.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hugh
>>>>>
>>>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>>>
>>>>>> Hi Hugh
>>>>>>
>>>>>> Again, just to make sure, are the hostnames in your host file
>>>>>> well-known?
>>>>>> I.e. when you say you can do
>>>>>> ssh nodename uptime
>>>>>> do you use exactly the same nodename in your host file?
>>>>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>>>>> because with your setup it should basically work.)
>>>>>>
>>>>>> One more point to consider is to update to Open-MPI 1.3.
>>>>>> I don't think your OPen-MPI version is the cause of your trouble,
>>>>>> but there have been quite some changes since v1.2.5
>>>>>>
>>>>>> Jody
>>>>>>
>>>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>>
>>>>>>> Hi Jody,
>>>>>>>
>>>>>>> Indeed, all the nodes are running the same version of Open MPI.
>>>>>>> Perhaps I
>>>>>>> was incorrect to describe the cluster as heterogeneous. In
>>>>>>> fact, all
>>>>>>> the
>>>>>>> nodes run the same operating system (Scientific Linux 5.2),
>>>>>>> it's only
>>>>>>> the
>>>>>>> hardware that's different and even then they're all i386 or
>>>>>>> i686. I'm
>>>>>>> also
>>>>>>> attaching the output of ompi_info --all as I've seen it's
>>>>>>> suggested in
>>>>>>> the
>>>>>>> mailing list instructions.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Hugh
>>>>>>>
>>>>>>> Hi Hugh
>>>>>>>
>>>>>>> Just to make sure:
>>>>>>> You have installed Open-MPI on all your nodes?
>>>>>>> Same version everywhere?
>>>>>>>
>>>>>>> Jody
>>>>>>>
>>>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> First of all let me make it perfectly clear that I'm a complete
>>>>>>>> beginner
>>>>>>>> as
>>>>>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>>>>>
>>>>>>>> I've tried to set up Open MPI to use SSH to communicate
>>>>>>>> between nodes
>>>>>>>> on
>>>>>>>> a
>>>>>>>> heterogeneous cluster. I've set up passwordless SSH and it
>>>>>>>> seems to
>>>>>>>> be
>>>>>>>> working fine. For example by hand I can do:
>>>>>>>>
>>>>>>>> ssh nodename uptime
>>>>>>>>
>>>>>>>> and it returns the appropriate information for each node.
>>>>>>>> I then tried running a non-MPI program on all the nodes at
>>>>>>>> the same
>>>>>>>> time:
>>>>>>>>
>>>>>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>>>>>
>>>>>>>> Where hostfile is a list of the 10 cluster node names with
>>>>>>>> slots=1
>>>>>>>> after
>>>>>>>> each one i.e
>>>>>>>>
>>>>>>>> nodename1 slots=1
>>>>>>>> nodename2 slots=2
>>>>>>>> etc...
>>>>>>>>
>>>>>>>> Nothing happens! The process just seems to hang. If I
>>>>>>>> interrupt the
>>>>>>>> process
>>>>>>>> with Ctrl-C I get:
>>>>>>>>
>>>>>>>> "
>>>>>>>>
>>>>>>>> mpirun: killing job...
>>>>>>>>
>>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:
>>>>>>>> Timeout in
>>>>>>>> file
>>>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:
>>>>>>>> Timeout in
>>>>>>>> file
>>>>>>>> pls_rsh_module.c at line 1166
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> WARNING: mpirun has exited before it received notification
>>>>>>>> that all
>>>>>>>> started processes had terminated. You should double check
>>>>>>>> and ensure
>>>>>>>> that there are no runaway processes still executing.
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> "
>>>>>>>>
>>>>>>>> If, instead of using the hostfile, I specify on the command
>>>>>>>> line the
>>>>>>>> host
>>>>>>>> from which I'm running mpirun, e.g.:
>>>>>>>>
>>>>>>>> mpirun -np 1 --host nodename uptime
>>>>>>>>
>>>>>>>> then it works (i.e. if it doesn't need to communicate with
>>>>>>>> other
>>>>>>>> nodes).
>>>>>>>> Do
>>>>>>>> I need to tell Open MPI it should be using SSH to
>>>>>>>> communicate? If so,
>>>>>>>> how
>>>>>>>> do
>>>>>>>> I do this? To be honest I think it's trying to do so, because
>>>>>>>> before
>>>>>>>> I
>>>>>>>> set
>>>>>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>>>>>
>>>>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux
>>>>>>>> 5.2. Let
>>>>>>>> me
>>>>>>>> reiterate, it's very likely that I've done something stupid,
>>>>>>>> so all
>>>>>>>> suggestions are welcome.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Hugh
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users