I'm running on a Fedora Core 9 Linux cluster with the mpi and home
directories mounted on the compute nodes via NFS. Since the executables
are on a remote server, I have configured mpi with --disable-dlopen and
have even gone as far as far enabling static and disabling shared. In
the process of trying to work around this problem, I upgraded from
openmpi 1.3.3 to 1.4.1. Also, the binaries were compiled with gcc
4.3.0, and the interconnect is ssh over ethernet.
Running from the fileserver, which is practically identical to the
compute nodes, I can run the c++ hello world (examples/hello_cxx.cc) on
up to three machines including the fileserver, but only two if the
fileserver is not included in the hostname list. In other words, either
mpirun -H filesrv,node1,node2 cpphello
mpirun -H node1,node2 cpphello
for any number of processes functions correctly. However, beyond the
two remote node limit, the application just hangs. orted shows up on
the remote systems, but nothing happens. Additionally, if I attempt to
do the same thing from any of the compute nodes, any attempt to run on a
remote node just hangs like before. Incidentally, this behavior is not
limited to hello world, and it occurs with non-mpi programs, like
hostname, also. Alternatively, when I run the c hello world
(examples/hello_c.c), I get the same hanging behavior. But, I also get
mca_btl_tcp_endpoint_complete_connect "no route to host" errors, even
though the processes appear to complete successfully. Although, I need
to kill (via ctrl-c) the overall mpi process.
As a further note, when testing this I also ran the boost::mpi tests,
and I noticed that the all_gather_test process would eventually start
remotely, but would peg the processor and never return. I have not
noticed this occur with the hello world programs.
Since they run better from the fileserver, I suspect it has something to
do with the NFS mount. But, I have no idea how to test that, or what to
do about it.
Any help would be greatly appreciated.