Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Hugh Dickinson (h.j.dickinson_at_[hidden])
Date: 2009-04-28 11:52:49


Hi again,

I tried a simple mpi c++ program:

--
#include <iostream>
#include <mpi.h>
using namespace MPI;
using namespace std;
int main(int argc, char* argv[]) {
   int rank,size;
   Init(argc,argv);
   rank=COMM_WORLD.Get_rank();
   size=COMM_WORLD.Get_size();
   cout << "P:" << rank << " out of " << size << endl;
   Finalize();
}
--
It didn't work over all the nodes, again same problem - the system  
seems to hang. However, by  forcing mpirun to use only the node on  
which I'm launching mpirun I get some more error messages
--
libibverbs: Fatal: couldn't read uverbs ABI version.
libibverbs: Fatal: couldn't read uverbs ABI version.
------------------------------------------------------------------------ 
--
[0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
------------------------------------------------------------------------ 
--
[0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
------------------------------------------------------------------------ 
--
[0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
------------------------------------------------------------------------ 
--
[0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
--
However, as before the program does work in this special case, and I  
get:
--
P:0 out of 2
P:1 out of 2
--
Do these errors indicate a problem with the Open MPI installation?
Hugh
On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
> Hi Jody,
>
> I can paswordlessly ssh between all nodes (to and from)
> Almost none of these mpirun commands work. The only working case is  
> if nodenameX is the node from which you are running the command. I  
> don't know if this gives you extra diagnostic information, but if I  
> explicitly set the wrong prefix (using --prefix), then I get errors  
> from all the nodes telling me the daemon would not start. I don't  
> get these errors normally. It seems to me that the communication is  
> working okay, at least in the outwards direction (and from all  
> nodes). Could this be a problem with forwarding of standard output?  
> If I were to try a simple hello world program, is this more likely  
> to work, or am I just adding another layer of complexity?
>
> Cheers,
>
> Hugh
>
> On 28 Apr 2009, at 15:55, jody wrote:
>
>> Hi Hugh
>> You're right, there is no initialization command (like lamboot)  you
>> have to call.
>>
>> I don't really know why your sewtup doesn't work, so i'm making some
>> more "blind shots"
>>
>> can you do passwordless ssh from between any two of your nodes?
>>
>> does
>>  mpirun -np 1 --host nodenameX uptime
>> work for every X when called from any of your nodes?
>>
>> Have you tried
>>    mpirun -np 2 --host nodename1,nodename2  uptime
>> (i.e. not using the host file)
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>>> Hi Jody,
>>>
>>> The node names are exactly the same. I wanted to avoid updating  
>>> the version
>>> because I'm not the system administrator, and it could take some  
>>> time before
>>> it gets done. If it's likely to fix the problem though I'll try  
>>> it. I'm
>>> assuming that I don't have to do something analogous to the old  
>>> "lamboot"
>>> command to initialise Open MPI on all the nodes. I've seen no  
>>> documentation
>>> anywhere that says I should.
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>
>>>> Hi Hugh
>>>>
>>>> Again, just to make sure, are the hostnames in your host file  
>>>> well-known?
>>>> I.e. when you say you can do
>>>>  ssh nodename uptime
>>>> do you use exactly the same nodename in your host file?
>>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>>> because with your setup it should basically work.)
>>>>
>>>> One more point to consider is to  update to Open-MPI 1.3.
>>>> I don't think your OPen-MPI version is the cause of your trouble,
>>>> but there have been quite some changes since v1.2.5
>>>>
>>>> Jody
>>>>
>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>
>>>>> Hi Jody,
>>>>>
>>>>> Indeed, all the nodes are running the same version of Open MPI.  
>>>>> Perhaps I
>>>>> was incorrect to describe the cluster as heterogeneous. In  
>>>>> fact, all the
>>>>> nodes run the same operating system (Scientific Linux 5.2),  
>>>>> it's only the
>>>>> hardware that's different and even then they're all i386 or  
>>>>> i686. I'm
>>>>> also
>>>>> attaching the output of ompi_info --all as I've seen it's  
>>>>> suggested in
>>>>> the
>>>>> mailing list instructions.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hugh
>>>>>
>>>>> Hi Hugh
>>>>>
>>>>> Just to make sure:
>>>>> You have installed Open-MPI on all your nodes?
>>>>> Same version everywhere?
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> First of all let me make it perfectly clear that I'm a  
>>>>>> complete beginner
>>>>>> as
>>>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>>>
>>>>>> I've tried to set up Open MPI to use SSH to communicate  
>>>>>> between nodes on
>>>>>> a
>>>>>> heterogeneous cluster. I've set up passwordless SSH and it  
>>>>>> seems to be
>>>>>> working fine. For example by hand I can do:
>>>>>>
>>>>>> ssh nodename uptime
>>>>>>
>>>>>> and it returns the appropriate information for each node.
>>>>>> I then tried running a non-MPI program on all the nodes at the  
>>>>>> same
>>>>>> time:
>>>>>>
>>>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>>>
>>>>>> Where hostfile is a list of the 10 cluster node names with  
>>>>>> slots=1 after
>>>>>> each one i.e
>>>>>>
>>>>>> nodename1 slots=1
>>>>>> nodename2 slots=2
>>>>>> etc...
>>>>>>
>>>>>> Nothing happens! The process just seems to hang. If I  
>>>>>> interrupt the
>>>>>> process
>>>>>> with Ctrl-C I get:
>>>>>>
>>>>>> "
>>>>>>
>>>>>> mpirun: killing job...
>>>>>>
>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:  
>>>>>> Timeout in
>>>>>> file
>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:  
>>>>>> Timeout in
>>>>>> file
>>>>>> pls_rsh_module.c at line 1166
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---------
>>>>>> WARNING: mpirun has exited before it received notification  
>>>>>> that all
>>>>>> started processes had terminated.  You should double check and  
>>>>>> ensure
>>>>>> that there are no runaway processes still executing.
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---------
>>>>>>
>>>>>> "
>>>>>>
>>>>>> If, instead of using the hostfile, I specify on the command  
>>>>>> line the
>>>>>> host
>>>>>> from which I'm running mpirun, e.g.:
>>>>>>
>>>>>> mpirun -np 1 --host nodename uptime
>>>>>>
>>>>>> then it works (i.e. if it doesn't need to communicate with  
>>>>>> other nodes).
>>>>>> Do
>>>>>> I need to tell Open MPI it should be using SSH to communicate?  
>>>>>> If so,
>>>>>> how
>>>>>> do
>>>>>> I do this? To be honest I think it's trying to do so, because  
>>>>>> before I
>>>>>> set
>>>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>>>
>>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux  
>>>>>> 5.2. Let me
>>>>>> reiterate, it's very likely that I've done something stupid,  
>>>>>> so all
>>>>>> suggestions are welcome.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hugh
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users