Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
From: Prentice Bisbal (prentice_at_[hidden])
Date: 2010-09-21 09:53:58

Ethan Deneault wrote:
> All,
> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically
> /opt/openmpi, but Red Hat does things differently. I have my PATH and
> LD_LIBRARY_PATH set correctly; because the test program does compile and
> run.
> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is
> a AMD x86_64 machine which serves the diskless node images and /home as
> an NFS mount. I compile all of my programs as 32-bit.
> My code is a simple hello world:
> $ more test.f
> program test
> include 'mpif.h'
> integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
> call MPI_INIT(ierror)
> call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
> call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
> print*, 'node', rank, ': Hello world'
> call MPI_FINALIZE(ierror)
> end
> If I run this program with:
> $ mpirun --machinefile testfile ./test.out
> node 0 : Hello world
> node 2 : Hello world
> node 1 : Hello world
> This is the expected output. Here, testfile contains the master node:
> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
> If I add another machine to testfile, say 'asterope', it hangs until I
> ctrl-c it. I have tried every machine, and as long as I do not include
> more than 3 hosts, the program will not hang.
> I have run the debug-daemons flag with it as well, and I don't see what
> is wrong specifically.

I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)

This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

2. Run the mpirun command from a different host. I'd try running it from
several different hosts.

3. Change your machinefile to include 4 completely different hosts.

I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.