Okay, that's one small step forward. You can lock that in by setting the appropriate MCA parameter in one of two ways:

1. add the following to your default mca parameter file:  btl = tcp,sm,self (I added the shared memory subsystem as this will help with performance). You can see how to do this here:

http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

2. add OMPI_MCA_btl=tcp,sm,self to the environment in your .cshrc (or equivalent) file

Next, have you looked at the following FAQ:

http://www.open-mpi.org/faq/?category=running#missing-prereqs

Are those things all okay? Have you tried providing a complete absolute path when running mpirun (e.g., using /usr/local/openmpi/bin/mpirun instead of just mpirun on the cmd line)?

Another thing to try: add --debug-devel to the cmd line and send us the (probably verbose) output.


On Apr 28, 2009, at 3:19 PM, Hugh Dickinson wrote:

Hi,

Yes I'm using ethernet connections. Doing as you suggest removes the errors generated by running the small test program, but still doesn't allow programs (including the small test program) to execute on any node other than the one launching mpirun. If I try to do that, the command hangs until I interrupt it, whereupon it gives the same timeout errors. It seems that there must be some problem with the setup of my Open MPI installation. Do you agree, and do you have any idea what it is? Also, is there a global settings file so I can instruct Open MPI to always only try ethernet?

Cheers,

Hugh

On 28 Apr 2009, at 20:12, Ralph Castain wrote:

In this instance, OMPI is complaining that you are attempting to use Infiniband, but no suitable devices are found.

I assume you have Ethernet between your nodes? Can you run this with the following added to your mpirun cmd line:

-mca btl tcp,self

That will cause OMPI to ignore the Infiniband subsystem and attempt to run via TCP over any available Ethernet.



On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson <h.j.dickinson@durham.ac.uk> wrote:
Many thanks for your help nonetheless.

Hugh


On 28 Apr 2009, at 17:23, jody wrote:

Hi Hugh

I'm sorry, but i must admit that i have never encountered these messages,
and i don't know what their cause exactly is.

Perhaps one of the developers can give an explanation?

Jody

On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
<h.j.dickinson@durham.ac.uk> wrote:
Hi again,

I tried a simple mpi c++ program:

--
#include <iostream>
#include <mpi.h>

using namespace MPI;
using namespace std;

int main(int argc, char* argv[]) {
 int rank,size;
 Init(argc,argv);
 rank=COMM_WORLD.Get_rank();
 size=COMM_WORLD.Get_size();
 cout << "P:" << rank << " out of " << size << endl;
 Finalize();
}
--
It didn't work over all the nodes, again same problem - the system seems to
hang. However, by  forcing mpirun to use only the node on which I'm
launching mpirun I get some more error messages

--
libibverbs: Fatal: couldn't read uverbs ABI version.
libibverbs: Fatal: couldn't read uverbs ABI version.
--------------------------------------------------------------------------
[0,1,0]: OpenIB on host gamma2 was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,1]: OpenIB on host gamma2 was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,1]: uDAPL on host gamma2 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,0]: uDAPL on host gamma2 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--

However, as before the program does work in this special case, and I get:
--
P:0 out of 2
P:1 out of 2
--

Do these errors indicate a problem with the Open MPI installation?

Hugh

On 28 Apr 2009, at 16:36, Hugh Dickinson wrote:

Hi Jody,

I can paswordlessly ssh between all nodes (to and from)
Almost none of these mpirun commands work. The only working case is if
nodenameX is the node from which you are running the command. I don't know
if this gives you extra diagnostic information, but if I explicitly set the
wrong prefix (using --prefix), then I get errors from all the nodes telling
me the daemon would not start. I don't get these errors normally. It seems
to me that the communication is working okay, at least in the outwards
direction (and from all nodes). Could this be a problem with forwarding of
standard output? If I were to try a simple hello world program, is this more
likely to work, or am I just adding another layer of complexity?

Cheers,

Hugh

On 28 Apr 2009, at 15:55, jody wrote:

Hi Hugh
You're right, there is no initialization command (like lamboot)  you
have to call.

I don't really know why your sewtup doesn't work, so i'm making some
more "blind shots"

can you do passwordless ssh from between any two of your nodes?

does
 mpirun -np 1 --host nodenameX uptime
work for every X when called from any of your nodes?

Have you tried
 mpirun -np 2 --host nodename1,nodename2  uptime
(i.e. not using the host file)

Jody

On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
<h.j.dickinson@durham.ac.uk> wrote:

Hi Jody,

The node names are exactly the same. I wanted to avoid updating the
version
because I'm not the system administrator, and it could take some time
before
it gets done. If it's likely to fix the problem though I'll try it. I'm
assuming that I don't have to do something analogous to the old
"lamboot"
command to initialise Open MPI on all the nodes. I've seen no
documentation
anywhere that says I should.

Cheers,

Hugh

On 28 Apr 2009, at 15:28, jody wrote:

Hi Hugh

Again, just to make sure, are the hostnames in your host file
well-known?
I.e. when you say you can do
 ssh nodename uptime
do you use exactly the same nodename in your host file?
(I'm trying to eliminate all non-Open-MPI error sources,
because with your setup it should basically work.)

One more point to consider is to  update to Open-MPI 1.3.
I don't think your OPen-MPI version is the cause of your trouble,
but there have been quite some changes since v1.2.5

Jody

On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
<h.j.dickinson@durham.ac.uk> wrote:

Hi Jody,

Indeed, all the nodes are running the same version of Open MPI.
Perhaps I
was incorrect to describe the cluster as heterogeneous. In fact, all
the
nodes run the same operating system (Scientific Linux 5.2), it's only
the
hardware that's different and even then they're all i386 or i686. I'm
also
attaching the output of ompi_info --all as I've seen it's suggested in
the
mailing list instructions.

Cheers,

Hugh

Hi Hugh

Just to make sure:
You have installed Open-MPI on all your nodes?
Same version everywhere?

Jody

On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
<h.j.dickinson_at_[hidden]> wrote:

Hi all,

First of all let me make it perfectly clear that I'm a complete
beginner
as
far as MPI is concerned, so this may well be a trivial problem!

I've tried to set up Open MPI to use SSH to communicate between nodes
on
a
heterogeneous cluster. I've set up passwordless SSH and it seems to
be
working fine. For example by hand I can do:

ssh nodename uptime

and it returns the appropriate information for each node.
I then tried running a non-MPI program on all the nodes at the same
time:

mpirun -np 10 --hostfile hostfile uptime

Where hostfile is a list of the 10 cluster node names with slots=1
after
each one i.e

nodename1 slots=1
nodename2 slots=2
etc...

Nothing happens! The process just seems to hang. If I interrupt the
process
with Ctrl-C I get:

"

mpirun: killing job...

[gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
file
base/pls_base_orted_cmds.c at line 275
[gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in
file
pls_rsh_module.c at line 1166


--------------------------------------------------------------------------
WARNING: mpirun has exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.


--------------------------------------------------------------------------

"

If, instead of using the hostfile, I specify on the command line the
host
from which I'm running mpirun, e.g.:

mpirun -np 1 --host nodename uptime

then it works (i.e. if it doesn't need to communicate with other
nodes).
Do
I need to tell Open MPI it should be using SSH to communicate? If so,
how
do
I do this? To be honest I think it's trying to do so, because before
I
set
up passwordless SSH it challenged me for lots of passwords.

I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let
me
reiterate, it's very likely that I've done something stupid, so all
suggestions are welcome.

Cheers,

Hugh

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users