Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: jody (jody.xha_at_[hidden])
Date: 2009-04-28 12:23:00


Hi Hugh

I'm sorry, but i must admit that i have never encountered these messages,
and i don't know what their cause exactly is.

Perhaps one of the developers can give an explanation?

Jody

On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
<h.j.dickinson_at_[hidden]> wrote:
> Hi again,
>
> I tried a simple mpi c++ program:
>
> --
> #include <iostream>
> #include <mpi.h>
>
> using namespace MPI;
> using namespace std;
>
> int main(int argc, char* argv[]) {
>  int rank,size;
>  Init(argc,argv);
>  rank=COMM_WORLD.Get_rank();
>  size=COMM_WORLD.Get_size();
>  cout << "P:" << rank << " out of " << size << endl;
>  Finalize();
> }
> --
> It didn't work over all the nodes, again same problem - the system seems to
> hang. However, by  forcing mpirun to use only the node on which I'm
> launching mpirun I get some more error messages
>
> --
> libibverbs: Fatal: couldn't read uverbs ABI version.
> libibverbs: Fatal: couldn't read uverbs ABI version.
> --------------------------------------------------------------------------
> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --
>
> However, as before the program does work in this special case, and I get:
> --
> P:0 out of 2
> P:1 out of 2
> --
>
> Do these errors indicate a problem with the Open MPI installation?
>
> Hugh
>
> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
>
>> Hi Jody,
>>
>> I can paswordlessly ssh between all nodes (to and from)
>> Almost none of these mpirun commands work. The only working case is if
>> nodenameX is the node from which you are running the command. I don't know
>> if this gives you extra diagnostic information, but if I explicitly set the
>> wrong prefix (using --prefix), then I get errors from all the nodes telling
>> me the daemon would not start. I don't get these errors normally. It seems
>> to me that the communication is working okay, at least in the outwards
>> direction (and from all nodes). Could this be a problem with forwarding of
>> standard output? If I were to try a simple hello world program, is this more
>> likely to work, or am I just adding another layer of complexity?
>>
>> Cheers,
>>
>> Hugh
>>
>> On 28 Apr 2009, at 15:55, jody wrote:
>>
>>> Hi Hugh
>>> You're right, there is no initialization command (like lamboot)  you
>>> have to call.
>>>
>>> I don't really know why your sewtup doesn't work, so i'm making some
>>> more "blind shots"
>>>
>>> can you do passwordless ssh from between any two of your nodes?
>>>
>>> does
>>>  mpirun -np 1 --host nodenameX uptime
>>> work for every X when called from any of your nodes?
>>>
>>> Have you tried
>>>   mpirun -np 2 --host nodename1,nodename2  uptime
>>> (i.e. not using the host file)
>>>
>>> Jody
>>>
>>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>
>>>> Hi Jody,
>>>>
>>>> The node names are exactly the same. I wanted to avoid updating the
>>>> version
>>>> because I'm not the system administrator, and it could take some time
>>>> before
>>>> it gets done. If it's likely to fix the problem though I'll try it. I'm
>>>> assuming that I don't have to do something analogous to the old
>>>> "lamboot"
>>>> command to initialise Open MPI on all the nodes. I've seen no
>>>> documentation
>>>> anywhere that says I should.
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>>
>>>>> Hi Hugh
>>>>>
>>>>> Again, just to make sure, are the hostnames in your host file
>>>>> well-known?
>>>>> I.e. when you say you can do
>>>>>  ssh nodename uptime
>>>>> do you use exactly the same nodename in your host file?
>>>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>>>> because with your setup it should basically work.)
>>>>>
>>>>> One more point to consider is to  update to Open-MPI 1.3.
>>>>> I don't think your OPen-MPI version is the cause of your trouble,
>>>>> but there have been quite some changes since v1.2.5
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>
>>>>>> Hi Jody,
>>>>>>
>>>>>> Indeed, all the nodes are running the same version of Open MPI.
>>>>>> Perhaps I
>>>>>> was incorrect to describe the cluster as heterogeneous. In fact, all
>>>>>> the
>>>>>> nodes run the same operating system (Scientific Linux 5.2), it's only
>>>>>> the
>>>>>> hardware that's different and even then they're all i386 or i686. I'm
>>>>>> also
>>>>>> attaching the output of ompi_info --all as I've seen it's suggested in
>>>>>> the
>>>>>> mailing list instructions.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hugh
>>>>>>
>>>>>> Hi Hugh
>>>>>>
>>>>>> Just to make sure:
>>>>>> You have installed Open-MPI on all your nodes?
>>>>>> Same version everywhere?
>>>>>>
>>>>>> Jody
>>>>>>
>>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> First of all let me make it perfectly clear that I'm a complete
>>>>>>> beginner
>>>>>>> as
>>>>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>>>>
>>>>>>> I've tried to set up Open MPI to use SSH to communicate between nodes
>>>>>>> on
>>>>>>> a
>>>>>>> heterogeneous cluster. I've set up passwordless SSH and it seems to
>>>>>>> be
>>>>>>> working fine. For example by hand I can do:
>>>>>>>
>>>>>>> ssh nodename uptime
>>>>>>>
>>>>>>> and it returns the appropriate information for each node.
>>>>>>> I then tried running a non-MPI program on all the nodes at the same
>>>>>>> time:
>>>>>>>
>>>>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>>>>
>>>>>>> Where hostfile is a list of the 10 cluster node names with slots=1
>>>>>>> after
>>>>>>> each one i.e
>>>>>>>
>>>>>>> nodename1 slots=1
>>>>>>> nodename2 slots=2
>>>>>>> etc...
>>>>>>>
>>>>>>> Nothing happens! The process just seems to hang. If I interrupt the
>>>>>>> process
>>>>>>> with Ctrl-C I get:
>>>>>>>
>>>>>>> "
>>>>>>>
>>>>>>> mpirun: killing job...
>>>>>>>
>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>>>>>>> file
>>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
>>>>>>> file
>>>>>>> pls_rsh_module.c at line 1166
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> WARNING: mpirun has exited before it received notification that all
>>>>>>> started processes had terminated.  You should double check and ensure
>>>>>>> that there are no runaway processes still executing.
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> "
>>>>>>>
>>>>>>> If, instead of using the hostfile, I specify on the command line the
>>>>>>> host
>>>>>>> from which I'm running mpirun, e.g.:
>>>>>>>
>>>>>>> mpirun -np 1 --host nodename uptime
>>>>>>>
>>>>>>> then it works (i.e. if it doesn't need to communicate with other
>>>>>>> nodes).
>>>>>>> Do
>>>>>>> I need to tell Open MPI it should be using SSH to communicate? If so,
>>>>>>> how
>>>>>>> do
>>>>>>> I do this? To be honest I think it's trying to do so, because before
>>>>>>> I
>>>>>>> set
>>>>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>>>>
>>>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let
>>>>>>> me
>>>>>>> reiterate, it's very likely that I've done something stupid, so all
>>>>>>> suggestions are welcome.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Hugh
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>