Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] mpirun hangs when launching job on remote node
From: Ron Babich (rbabich_at_[hidden])
Date: 2009-03-17 15:16:53


Hi Everyone,

I'm having a very basic problem getting an MPI job to run on multiple nodes.
My setup consists of two identically configured nodes, called node01 and
node02, connected via ethernet and infiniband. They are running CentOS 5.2 and
the bundled OMPI, version 1.2.5. I've attached the output of "ompi_info
--all", which is the same for both nodes.

The problem is that if I run any of the following (on node01), mpirun simply
hangs:

mpirun -np 2 -host node01,node02 uname
mpirun -host node02 uname
mpirun -host node02 -mca btl tcp,self uname
mpirun -host node02 -mca btl tcp,self,^openib uname

Of course, before running "uname" as a test, I had been trying out a simple MPI
code with the same result. At this point, to keep things simple, I'm not too
worried about getting the infiniband working. I even went so far as to unload
the infiniband kernel modules (via "/etc/init.d/openibd stop" on both nodes) to
make sure OMPI was using ethernet only.

As a sanity check, each of the following works fine:

node01:~ % mpirun uname
Linux
node01:~ % mpirun -np 2 uname
Linux
Linux
node01:~ % ssh node02 uname
Linux
node01:~ % ssh node02 mpirun -np 2 uname
Linux
Linux
node01:~ % ssh node02 echo \$PATH
/usr/lib64/openmpi/1.2.5-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib64/openmpi/1.2.5-gcc/bin:/home/rbabich/bin:.
node01:~ % ssh node02 echo \$LD_LIBRARY_PATH
/usr/lib64/openmpi/1.2.5-gcc/lib:/usr/local/cuda/lib

Both $PATH and $LD_LIBRARY_PATH seem to be set correctly. There is no firewall
running on either of the nodes, and everything I've said holds true if I
reverse the roles of node01 and node02. In particular, I can ssh both ways.
The local network is specified with a simple /etc/hosts:

127.0.0.1 localhost.localdomain localhost
: : 1 localhost6.localdomain6 localhost6

192.168.0.1 frontend
192.168.0.101 node01
192.168.0.102 node02

When I try any of the above mpirun commands, orted on node02 seems to start
successfully, but nothing happens. For example, if I run the following on
node01:

node01:~ % mpirun -host node02 uname

it hangs, and on node02 I get:

node02:~ % ps aux | grep orted
rbabich 7741 0.0 0.0 75656 1868 ? Ss 14:53 0:00
/usr/lib64/openmpi/1.2.5-gcc/bin/orted --bootproxy 1 --name 0.0.1 --num_procs 2
--vpid_start 0 --nodename node02 --universe
rbabich_at_node01:default-universe-8105 --nsreplica
0.0.0;tcp://192.168.0.101:52342 --gprreplica 0.0.0;tcp://192.168.0.101:52342

Any ideas?

Thanks,
Ron