Subject: [OMPI users] mpirun hangs when launching job on remote node
From: Ron Babich (rbabich_at_[hidden])
Date: 2009-03-17 15:16:53

Hi Everyone,

I'm having a very basic problem getting an MPI job to run on multiple nodes.
My setup consists of two identically configured nodes, called node01 and
node02, connected via ethernet and infiniband. They are running CentOS 5.2 and
the bundled OMPI, version 1.2.5. I've attached the output of "ompi_info
--all", which is the same for both nodes.

The problem is that if I run any of the following (on node01), mpirun simply

mpirun -np 2 -host node01,node02 uname
mpirun -host node02 uname
mpirun -host node02 -mca btl tcp,self uname
mpirun -host node02 -mca btl tcp,self,^openib uname

Of course, before running "uname" as a test, I had been trying out a simple MPI
code with the same result. At this point, to keep things simple, I'm not too
worried about getting the infiniband working. I even went so far as to unload
the infiniband kernel modules (via "/etc/init.d/openibd stop" on both nodes) to
make sure OMPI was using ethernet only.

As a sanity check, each of the following works fine:

node01:~ % mpirun uname
node01:~ % mpirun -np 2 uname
node01:~ % ssh node02 uname
node01:~ % ssh node02 mpirun -np 2 uname
node01:~ % ssh node02 echo \$PATH
node01:~ % ssh node02 echo \$LD_LIBRARY_PATH

Both $PATH and $LD_LIBRARY_PATH seem to be set correctly. There is no firewall
running on either of the nodes, and everything I've said holds true if I
reverse the roles of node01 and node02. In particular, I can ssh both ways.
The local network is specified with a simple /etc/hosts: localhost.localdomain localhost
: : 1 localhost6.localdomain6 localhost6 frontend node01 node02

When I try any of the above mpirun commands, orted on node02 seems to start
successfully, but nothing happens. For example, if I run the following on

node01:~ % mpirun -host node02 uname

it hangs, and on node02 I get:

node02:~ % ps aux | grep orted
rbabich 7741 0.0 0.0 75656 1868 ? Ss 14:53 0:00
/usr/lib64/openmpi/1.2.5-gcc/bin/orted --bootproxy 1 --name 0.0.1 --num_procs 2
--vpid_start 0 --nodename node02 --universe
rbabich_at_node01:default-universe-8105 --nsreplica
0.0.0;tcp:// --gprreplica 0.0.0;tcp://

Any ideas?