Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Help diagnosing problem: not being able to run MPI code across computers
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-05-04 20:29:21

On May 4, 2013, at 4:54 PM, Angel de Vicente <angelv_at_[hidden]> wrote:

> Hi,
> I have used OpenMPI before without any troubles, and configured MPICH,
> MPICH2 and OpenMPI in many different machines before, but recently we
> upgraded the OS to Fedora 17, and now I'm having trouble running an MPI
> code in two of our machines connected via a switch.
> I thought perhaps the old installation was giving problems, so I
> reinstalled OpenMPI (1.6.4) and I have no trouble when running a
> parallel code in just one node. I also don't have any trouble ssh'ing
> (without need for password) between these machines, but when I try to
> run a parallel job spanning both machines, I get a hanged mpiexec
> process in the submitting machine, and an "orted" process in the other
> machine, but nothing moves.
> I guess it is an issue with libraries and/or different MPI versions (the
> machines have other site-wide MPI libraries installed), but I'm not sure
> how to debug the issue. I looked in the FAQ, but I didn't find anything
> relevant. Issue
> is
> different, since I don't get any warning or errors when running, just
> all processes stuck.
> Is there any way to dump details of what OpenMPI is trying to do in each
> node, so I can see if it is looking for different libraries in each
> node, or something similar?

What I do is simply "ssh ompi_info -V" to each remote node and compare results - you should get the same answer everywhere.

Another option in these situations is to configure --enable-orterun-prefix-by-default. If you install in the same location on each node (e.g., on an NSF mount), then this will ensure you get that same library.

> Thanks,
> --
> Ángel de Vicente
> _______________________________________________
> users mailing list
> users_at_[hidden]