My apologies for not sending this with the initial email. I've enclosed
our cluster setup. This includes the ifconfig, ld_library_path, path,
and fstab for three compute nodes and the file server itself. The
ld_library_path, path, and fstab setups are the same across the compute
nodes, so they're only included once. Additionally, I've included the
config.log which you'll note turns off both f77 and f90. This was done
to turn on the threading in an attempt to combat the
mca_btl_tcp_endpoint_complete_connect problem, but failed to do so.
Robert Collyer wrote:
> I'm running on a Fedora Core 9 Linux cluster with the mpi and home
> directories mounted on the compute nodes via NFS. Since the
> executables are on a remote server, I have configured mpi with
> --disable-dlopen and have even gone as far as far enabling static and
> disabling shared. In the process of trying to work around this
> problem, I upgraded from openmpi 1.3.3 to 1.4.1. Also, the binaries
> were compiled with gcc 4.3.0, and the interconnect is ssh over ethernet.
> Running from the fileserver, which is practically identical to the
> compute nodes, I can run the c++ hello world (examples/hello_cxx.cc)
> on up to three machines including the fileserver, but only two if the
> fileserver is not included in the hostname list. In other words,
> either this
> mpirun -H filesrv,node1,node2 cpphello
> mpirun -H node1,node2 cpphello
> for any number of processes functions correctly. However, beyond the
> two remote node limit, the application just hangs. orted shows up on
> the remote systems, but nothing happens. Additionally, if I attempt
> to do the same thing from any of the compute nodes, any attempt to run
> on a remote node just hangs like before. Incidentally, this behavior
> is not limited to hello world, and it occurs with non-mpi programs,
> like hostname, also. Alternatively, when I run the c hello world
> (examples/hello_c.c), I get the same hanging behavior. But, I also
> get mca_btl_tcp_endpoint_complete_connect "no route to host" errors,
> even though the processes appear to complete successfully. Although,
> I need to kill (via ctrl-c) the overall mpi process.
> As a further note, when testing this I also ran the boost::mpi tests,
> and I noticed that the all_gather_test process would eventually start
> remotely, but would peg the processor and never return. I have not
> noticed this occur with the hello world programs.
> Since they run better from the fileserver, I suspect it has something
> to do with the NFS mount. But, I have no idea how to test that, or
> what to do about it.
> Any help would be greatly appreciated.
> users mailing list