I have used OpenMPI before without any troubles, and configured MPICH,
MPICH2 and OpenMPI in many different machines before, but recently we
upgraded the OS to Fedora 17, and now I'm having trouble running an MPI
code in two of our machines connected via a switch.
I thought perhaps the old installation was giving problems, so I
reinstalled OpenMPI (1.6.4) and I have no trouble when running a
parallel code in just one node. I also don't have any trouble ssh'ing
(without need for password) between these machines, but when I try to
run a parallel job spanning both machines, I get a hanged mpiexec
process in the submitting machine, and an "orted" process in the other
machine, but nothing moves.
I guess it is an issue with libraries and/or different MPI versions (the
machines have other site-wide MPI libraries installed), but I'm not sure
how to debug the issue. I looked in the FAQ, but I didn't find anything
different, since I don't get any warning or errors when running, just
all processes stuck.
Is there any way to dump details of what OpenMPI is trying to do in each
node, so I can see if it is looking for different libraries in each
node, or something similar?
Ãngel de Vicente