Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Hugh Dickinson (h.j.dickinson_at_[hidden])
Date: 2009-04-28 11:52:49


Hi again,

I tried a simple mpi c++ program:

--
#include <iostream>
#include <mpi.h>
using namespace MPI;
using namespace std;
int main(int argc, char* argv[]) {
   int rank,size;
   Init(argc,argv);
   rank=COMM_WORLD.Get_rank();
   size=COMM_WORLD.Get_size();
   cout << "P:" << rank << " out of " << size << endl;
   Finalize();
}
--
It didn't work over all the nodes, again same problem - the system  
seems to hang. However, by  forcing mpirun to use only the node on  
which I'm launching mpirun I get some more error messages
--
libibverbs: Fatal: couldn't read uverbs ABI version.
libibverbs: Fatal: couldn't read uverbs ABI version.
------------------------------------------------------------------------ 
--
[0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
------------------------------------------------------------------------ 
--
[0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
------------------------------------------------------------------------ 
--
[0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
------------------------------------------------------------------------ 
--
[0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
------------------------------------------------------------------------ 
--
--
However, as before the program does work in this special case, and I  
get:
--
P:0 out of 2
P:1 out of 2
--
Do these errors indicate a problem with the Open MPI installation?
Hugh
On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
> Hi Jody,
>
> I can paswordlessly ssh between all nodes (to and from)
> Almost none of these mpirun commands work. The only working case is  
> if nodenameX is the node from which you are running the command. I  
> don't know if this gives you extra diagnostic information, but if I  
> explicitly set the wrong prefix (using --prefix), then I get errors  
> from all the nodes telling me the daemon would not start. I don't  
> get these errors normally. It seems to me that the communication is  
> working okay, at least in the outwards direction (and from all  
> nodes). Could this be a problem with forwarding of standard output?  
> If I were to try a simple hello world program, is this more likely  
> to work, or am I just adding another layer of complexity?
>
> Cheers,
>
> Hugh
>
> On 28 Apr 2009, at 15:55, jody wrote:
>
>> Hi Hugh
>> You're right, there is no initialization command (like lamboot)  you
>> have to call.
>>
>> I don't really know why your sewtup doesn't work, so i'm making some
>> more "blind shots"
>>
>> can you do passwordless ssh from between any two of your nodes?
>>
>> does
>>  mpirun -np 1 --host nodenameX uptime
>> work for every X when called from any of your nodes?
>>
>> Have you tried
>>    mpirun -np 2 --host nodename1,nodename2  uptime
>> (i.e. not using the host file)
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>>> Hi Jody,
>>>
>>> The node names are exactly the same. I wanted to avoid updating  
>>> the version
>>> because I'm not the system administrator, and it could take some  
>>> time before
>>> it gets done. If it's likely to fix the problem though I'll try  
>>> it. I'm
>>> assuming that I don't have to do something analogous to the old  
>>> "lamboot"
>>> command to initialise Open MPI on all the nodes. I've seen no  
>>> documentation
>>> anywhere that says I should.
>>>
>>> Cheers,
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>
>>>> Hi Hugh
>>>>
>>>> Again, just to make sure, are the hostnames in your host file  
>>>> well-known?
>>>> I.e. when you say you can do
>>>>  ssh nodename uptime
>>>> do you use exactly the same nodename in your host file?
>>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>>> because with your setup it should basically work.)
>>>>
>>>> One more point to consider is to  update to Open-MPI 1.3.
>>>> I don't think your OPen-MPI version is the cause of your trouble,
>>>> but there have been quite some changes since v1.2.5
>>>>
>>>> Jody
>>>>
>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>
>>>>> Hi Jody,
>>>>>
>>>>> Indeed, all the nodes are running the same version of Open MPI.  
>>>>> Perhaps I
>>>>> was incorrect to describe the cluster as heterogeneous. In  
>>>>> fact, all the
>>>>> nodes run the same operating system (Scientific Linux 5.2),  
>>>>> it's only the
>>>>> hardware that's different and even then they're all i386 or  
>>>>> i686. I'm
>>>>> also
>>>>> attaching the output of ompi_info --all as I've seen it's  
>>>>> suggested in
>>>>> the
>>>>> mailing list instructions.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hugh
>>>>>
>>>>> Hi Hugh
>>>>>
>>>>> Just to make sure:
>>>>> You have installed Open-MPI on all your nodes?
>>>>> Same version everywhere?
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> First of all let me make it perfectly clear that I'm a  
>>>>>> complete beginner
>>>>>> as
>>>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>>>
>>>>>> I've tried to set up Open MPI to use SSH to communicate  
>>>>>> between nodes on
>>>>>> a
>>>>>> heterogeneous cluster. I've set up passwordless SSH and it  
>>>>>> seems to be
>>>>>> working fine. For example by hand I can do:
>>>>>>
>>>>>> ssh nodename uptime
>>>>>>
>>>>>> and it returns the appropriate information for each node.
>>>>>> I then tried running a non-MPI program on all the nodes at the  
>>>>>> same
>>>>>> time:
>>>>>>
>>>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>>>
>>>>>> Where hostfile is a list of the 10 cluster node names with  
>>>>>> slots=1 after
>>>>>> each one i.e
>>>>>>
>>>>>> nodename1 slots=1
>>>>>> nodename2 slots=2
>>>>>> etc...
>>>>>>
>>>>>> Nothing happens! The process just seems to hang. If I  
>>>>>> interrupt the
>>>>>> process
>>>>>> with Ctrl-C I get:
>>>>>>
>>>>>> "
>>>>>>
>>>>>> mpirun: killing job...
>>>>>>
>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:  
>>>>>> Timeout in
>>>>>> file
>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:  
>>>>>> Timeout in
>>>>>> file
>>>>>> pls_rsh_module.c at line 1166
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---------
>>>>>> WARNING: mpirun has exited before it received notification  
>>>>>> that all
>>>>>> started processes had terminated.  You should double check and  
>>>>>> ensure
>>>>>> that there are no runaway processes still executing.
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---------
>>>>>>
>>>>>> "
>>>>>>
>>>>>> If, instead of using the hostfile, I specify on the command  
>>>>>> line the
>>>>>> host
>>>>>> from which I'm running mpirun, e.g.:
>>>>>>
>>>>>> mpirun -np 1 --host nodename uptime
>>>>>>
>>>>>> then it works (i.e. if it doesn't need to communicate with  
>>>>>> other nodes).
>>>>>> Do
>>>>>> I need to tell Open MPI it should be using SSH to communicate?  
>>>>>> If so,
>>>>>> how
>>>>>> do
>>>>>> I do this? To be honest I think it's trying to do so, because  
>>>>>> before I
>>>>>> set
>>>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>>>
>>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux  
>>>>>> 5.2. Let me
>>>>>> reiterate, it's very likely that I've done something stupid,  
>>>>>> so all
>>>>>> suggestions are welcome.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hugh
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users