Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-28 15:12:30


In this instance, OMPI is complaining that you are attempting to use
Infiniband, but no suitable devices are found.

I assume you have Ethernet between your nodes? Can you run this with the
following added to your mpirun cmd line:

-mca btl tcp,self

That will cause OMPI to ignore the Infiniband subsystem and attempt to run
via TCP over any available Ethernet.

On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson <h.j.dickinson_at_[hidden]
> wrote:

> Many thanks for your help nonetheless.
>
> Hugh
>
>
> On 28 Apr 2009, at 17:23, jody wrote:
>
> Hi Hugh
>>
>> I'm sorry, but i must admit that i have never encountered these messages,
>> and i don't know what their cause exactly is.
>>
>> Perhaps one of the developers can give an explanation?
>>
>> Jody
>>
>> On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
>> <h.j.dickinson_at_[hidden]> wrote:
>>
>>> Hi again,
>>>
>>> I tried a simple mpi c++ program:
>>>
>>> --
>>> #include <iostream>
>>> #include <mpi.h>
>>>
>>> using namespace MPI;
>>> using namespace std;
>>>
>>> int main(int argc, char* argv[]) {
>>> int rank,size;
>>> Init(argc,argv);
>>> rank=COMM_WORLD.Get_rank();
>>> size=COMM_WORLD.Get_size();
>>> cout << "P:" << rank << " out of " << size << endl;
>>> Finalize();
>>> }
>>> --
>>> It didn't work over all the nodes, again same problem - the system seems
>>> to
>>> hang. However, by forcing mpirun to use only the node on which I'm
>>> launching mpirun I get some more error messages
>>>
>>> --
>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>> libibverbs: Fatal: couldn't read uverbs ABI version.
>>>
>>> --------------------------------------------------------------------------
>>> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>>
>>> --------------------------------------------------------------------------
>>> --
>>>
>>> However, as before the program does work in this special case, and I get:
>>> --
>>> P:0 out of 2
>>> P:1 out of 2
>>> --
>>>
>>> Do these errors indicate a problem with the Open MPI installation?
>>>
>>> Hugh
>>>
>>> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:
>>>
>>> Hi Jody,
>>>>
>>>> I can paswordlessly ssh between all nodes (to and from)
>>>> Almost none of these mpirun commands work. The only working case is if
>>>> nodenameX is the node from which you are running the command. I don't
>>>> know
>>>> if this gives you extra diagnostic information, but if I explicitly set
>>>> the
>>>> wrong prefix (using --prefix), then I get errors from all the nodes
>>>> telling
>>>> me the daemon would not start. I don't get these errors normally. It
>>>> seems
>>>> to me that the communication is working okay, at least in the outwards
>>>> direction (and from all nodes). Could this be a problem with forwarding
>>>> of
>>>> standard output? If I were to try a simple hello world program, is this
>>>> more
>>>> likely to work, or am I just adding another layer of complexity?
>>>>
>>>> Cheers,
>>>>
>>>> Hugh
>>>>
>>>> On 28 Apr 2009, at 15:55, jody wrote:
>>>>
>>>> Hi Hugh
>>>>> You're right, there is no initialization command (like lamboot) you
>>>>> have to call.
>>>>>
>>>>> I don't really know why your sewtup doesn't work, so i'm making some
>>>>> more "blind shots"
>>>>>
>>>>> can you do passwordless ssh from between any two of your nodes?
>>>>>
>>>>> does
>>>>> mpirun -np 1 --host nodenameX uptime
>>>>> work for every X when called from any of your nodes?
>>>>>
>>>>> Have you tried
>>>>> mpirun -np 2 --host nodename1,nodename2 uptime
>>>>> (i.e. not using the host file)
>>>>>
>>>>> Jody
>>>>>
>>>>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>
>>>>>>
>>>>>> Hi Jody,
>>>>>>
>>>>>> The node names are exactly the same. I wanted to avoid updating the
>>>>>> version
>>>>>> because I'm not the system administrator, and it could take some time
>>>>>> before
>>>>>> it gets done. If it's likely to fix the problem though I'll try it.
>>>>>> I'm
>>>>>> assuming that I don't have to do something analogous to the old
>>>>>> "lamboot"
>>>>>> command to initialise Open MPI on all the nodes. I've seen no
>>>>>> documentation
>>>>>> anywhere that says I should.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hugh
>>>>>>
>>>>>> On 28 Apr 2009, at 15:28, jody wrote:
>>>>>>
>>>>>> Hi Hugh
>>>>>>>
>>>>>>> Again, just to make sure, are the hostnames in your host file
>>>>>>> well-known?
>>>>>>> I.e. when you say you can do
>>>>>>> ssh nodename uptime
>>>>>>> do you use exactly the same nodename in your host file?
>>>>>>> (I'm trying to eliminate all non-Open-MPI error sources,
>>>>>>> because with your setup it should basically work.)
>>>>>>>
>>>>>>> One more point to consider is to update to Open-MPI 1.3.
>>>>>>> I don't think your OPen-MPI version is the cause of your trouble,
>>>>>>> but there have been quite some changes since v1.2.5
>>>>>>>
>>>>>>> Jody
>>>>>>>
>>>>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
>>>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Jody,
>>>>>>>>
>>>>>>>> Indeed, all the nodes are running the same version of Open MPI.
>>>>>>>> Perhaps I
>>>>>>>> was incorrect to describe the cluster as heterogeneous. In fact, all
>>>>>>>> the
>>>>>>>> nodes run the same operating system (Scientific Linux 5.2), it's
>>>>>>>> only
>>>>>>>> the
>>>>>>>> hardware that's different and even then they're all i386 or i686.
>>>>>>>> I'm
>>>>>>>> also
>>>>>>>> attaching the output of ompi_info --all as I've seen it's suggested
>>>>>>>> in
>>>>>>>> the
>>>>>>>> mailing list instructions.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Hugh
>>>>>>>>
>>>>>>>> Hi Hugh
>>>>>>>>
>>>>>>>> Just to make sure:
>>>>>>>> You have installed Open-MPI on all your nodes?
>>>>>>>> Same version everywhere?
>>>>>>>>
>>>>>>>> Jody
>>>>>>>>
>>>>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
>>>>>>>> <h.j.dickinson_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> First of all let me make it perfectly clear that I'm a complete
>>>>>>>>> beginner
>>>>>>>>> as
>>>>>>>>> far as MPI is concerned, so this may well be a trivial problem!
>>>>>>>>>
>>>>>>>>> I've tried to set up Open MPI to use SSH to communicate between
>>>>>>>>> nodes
>>>>>>>>> on
>>>>>>>>> a
>>>>>>>>> heterogeneous cluster. I've set up passwordless SSH and it seems to
>>>>>>>>> be
>>>>>>>>> working fine. For example by hand I can do:
>>>>>>>>>
>>>>>>>>> ssh nodename uptime
>>>>>>>>>
>>>>>>>>> and it returns the appropriate information for each node.
>>>>>>>>> I then tried running a non-MPI program on all the nodes at the same
>>>>>>>>> time:
>>>>>>>>>
>>>>>>>>> mpirun -np 10 --hostfile hostfile uptime
>>>>>>>>>
>>>>>>>>> Where hostfile is a list of the 10 cluster node names with slots=1
>>>>>>>>> after
>>>>>>>>> each one i.e
>>>>>>>>>
>>>>>>>>> nodename1 slots=1
>>>>>>>>> nodename2 slots=2
>>>>>>>>> etc...
>>>>>>>>>
>>>>>>>>> Nothing happens! The process just seems to hang. If I interrupt the
>>>>>>>>> process
>>>>>>>>> with Ctrl-C I get:
>>>>>>>>>
>>>>>>>>> "
>>>>>>>>>
>>>>>>>>> mpirun: killing job...
>>>>>>>>>
>>>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout
>>>>>>>>> in
>>>>>>>>> file
>>>>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout
>>>>>>>>> in
>>>>>>>>> file
>>>>>>>>> pls_rsh_module.c at line 1166
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> WARNING: mpirun has exited before it received notification that all
>>>>>>>>> started processes had terminated. You should double check and
>>>>>>>>> ensure
>>>>>>>>> that there are no runaway processes still executing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> "
>>>>>>>>>
>>>>>>>>> If, instead of using the hostfile, I specify on the command line
>>>>>>>>> the
>>>>>>>>> host
>>>>>>>>> from which I'm running mpirun, e.g.:
>>>>>>>>>
>>>>>>>>> mpirun -np 1 --host nodename uptime
>>>>>>>>>
>>>>>>>>> then it works (i.e. if it doesn't need to communicate with other
>>>>>>>>> nodes).
>>>>>>>>> Do
>>>>>>>>> I need to tell Open MPI it should be using SSH to communicate? If
>>>>>>>>> so,
>>>>>>>>> how
>>>>>>>>> do
>>>>>>>>> I do this? To be honest I think it's trying to do so, because
>>>>>>>>> before
>>>>>>>>> I
>>>>>>>>> set
>>>>>>>>> up passwordless SSH it challenged me for lots of passwords.
>>>>>>>>>
>>>>>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let
>>>>>>>>> me
>>>>>>>>> reiterate, it's very likely that I've done something stupid, so all
>>>>>>>>> suggestions are welcome.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Hugh
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>