On May 4, 2013, at 4:54 PM, Angel de Vicente <angelv_at_[hidden]> wrote:
> I have used OpenMPI before without any troubles, and configured MPICH,
> MPICH2 and OpenMPI in many different machines before, but recently we
> upgraded the OS to Fedora 17, and now I'm having trouble running an MPI
> code in two of our machines connected via a switch.
> I thought perhaps the old installation was giving problems, so I
> reinstalled OpenMPI (1.6.4) and I have no trouble when running a
> parallel code in just one node. I also don't have any trouble ssh'ing
> (without need for password) between these machines, but when I try to
> run a parallel job spanning both machines, I get a hanged mpiexec
> process in the submitting machine, and an "orted" process in the other
> machine, but nothing moves.
> I guess it is an issue with libraries and/or different MPI versions (the
> machines have other site-wide MPI libraries installed), but I'm not sure
> how to debug the issue. I looked in the FAQ, but I didn't find anything
> relevant. Issue
> http://www.open-mpi.org/faq/?category=running#intel-compilers-static is
> different, since I don't get any warning or errors when running, just
> all processes stuck.
> Is there any way to dump details of what OpenMPI is trying to do in each
> node, so I can see if it is looking for different libraries in each
> node, or something similar?
What I do is simply "ssh ompi_info -V" to each remote node and compare results - you should get the same answer everywhere.
Another option in these situations is to configure --enable-orterun-prefix-by-default. If you install in the same location on each node (e.g., on an NSF mount), then this will ensure you get that same library.
> Ángel de Vicente
> users mailing list