Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
From: Gus Correa (gus_at_[hidden])
Date: 2010-09-21 10:18:30

Prentice Bisbal wrote:
> Ethan Deneault wrote:
>> All,
>> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
>> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically
>> /opt/openmpi, but Red Hat does things differently. I have my PATH and
>> LD_LIBRARY_PATH set correctly; because the test program does compile and
>> run.
>> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is
>> a AMD x86_64 machine which serves the diskless node images and /home as
>> an NFS mount. I compile all of my programs as 32-bit.
>> My code is a simple hello world:
>> $ more test.f
>> program test
>> include 'mpif.h'
>> integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
>> call MPI_INIT(ierror)
>> call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>> call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>> print*, 'node', rank, ': Hello world'
>> call MPI_FINALIZE(ierror)
>> end
>> If I run this program with:
>> $ mpirun --machinefile testfile ./test.out
>> node 0 : Hello world
>> node 2 : Hello world
>> node 1 : Hello world
>> This is the expected output. Here, testfile contains the master node:
>> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
>> If I add another machine to testfile, say 'asterope', it hangs until I
>> ctrl-c it. I have tried every machine, and as long as I do not include
>> more than 3 hosts, the program will not hang.
>> I have run the debug-daemons flag with it as well, and I don't see what
>> is wrong specifically.
> I'm assuming you already tested ssh connectivity and verified everything
> is working as it should. (You did test all that, right?)
> This sounds like configuration problem on one of the nodes, or a problem
> with ssh. I suspect it's not a problem with the number of processes, but
> whichever node is the 4th in your machinefile has a connectivity or
> configuration issue:
> I would try the following:
> 1. reorder the list of hosts in your machine file.
> 2. Run the mpirun command from a different host. I'd try running it from
> several different hosts.
> 3. Change your machinefile to include 4 completely different hosts.
> I think someone else recommended that you should be specifying the
> number of process with -np. I second that.
> If the above fails, you might want to post your machine file your using.

Hi Ethan

What your program prints is process number, not the host name.
To make sure all nodes are responding, you can try this:

For the hostfile/machinefile structure,
including the number of slots/cores/processors, see "man mpiexec".

The OpenMPI FAQ have answers for many of these initial setup questions.
Worth taking a look.

I hope it helps,
Gus Correa