I'm transitioning from LAM-MPI to OpenMPI and have just compiled OMPI
1.0.2 on OS X server 10.4.6. I'm using gcc 3.3 and XLF (both f77 and
f90), and I'm using ssh to run the jobs. My cluster is all G5 dual
2GHz+ xserves, and I am using both ethernet ports for communication.
One is used for NFS and the other is for MPI.
I've had few problems the past year running this config with LAM-MPI
(latest release). But what worked before doesn't with OpenMPI 1.0.2.
When I run any parallel job that spans multiple machines, the process
runs indefinitely. I've checked this using the BLACS and PBLAS test
routines, the HPL benchmark, and even a simple mpi-pong program. All
of them execute but produce no output past some initial output,
consuming 100% of the CPU on every node that's launched. In contrast,
all of these programs run in a few seconds on a single node, with two
processors, and up to -np 8. When I cntrl-C to stop the program,
openmpi safely stops all the processes, no matter how many machines
have been used.
I noticed a couple postings from the past few months that seem to be
related but didn't seem to be quite the same symptoms. Any ideas what
could be going on?
OpenMPI is a really great project, and it is obvious the quality of
software development that has gone into it. I appreciate all your
help. My config.log and omni-info.out files are attached.
Aerospace Engineering Sciences
University of Colorado