Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
From: Gus Correa (gus_at_[hidden])
Date: 2010-09-21 10:18:30


Prentice Bisbal wrote:
> Ethan Deneault wrote:
>> All,
>>
>> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
>> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically
>> /opt/openmpi, but Red Hat does things differently. I have my PATH and
>> LD_LIBRARY_PATH set correctly; because the test program does compile and
>> run.
>>
>> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is
>> a AMD x86_64 machine which serves the diskless node images and /home as
>> an NFS mount. I compile all of my programs as 32-bit.
>>
>> My code is a simple hello world:
>> $ more test.f
>> program test
>>
>> include 'mpif.h'
>> integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
>>
>> call MPI_INIT(ierror)
>> call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>> call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>> print*, 'node', rank, ': Hello world'
>> call MPI_FINALIZE(ierror)
>> end
>>
>> If I run this program with:
>>
>> $ mpirun --machinefile testfile ./test.out
>> node 0 : Hello world
>> node 2 : Hello world
>> node 1 : Hello world
>>
>> This is the expected output. Here, testfile contains the master node:
>> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
>>
>> If I add another machine to testfile, say 'asterope', it hangs until I
>> ctrl-c it. I have tried every machine, and as long as I do not include
>> more than 3 hosts, the program will not hang.
>>
>> I have run the debug-daemons flag with it as well, and I don't see what
>> is wrong specifically.
>>
>
> I'm assuming you already tested ssh connectivity and verified everything
> is working as it should. (You did test all that, right?)
>
> This sounds like configuration problem on one of the nodes, or a problem
> with ssh. I suspect it's not a problem with the number of processes, but
> whichever node is the 4th in your machinefile has a connectivity or
> configuration issue:
>
> I would try the following:
>
> 1. reorder the list of hosts in your machine file.
>
> 2. Run the mpirun command from a different host. I'd try running it from
> several different hosts.
>
> 3. Change your machinefile to include 4 completely different hosts.
>
> I think someone else recommended that you should be specifying the
> number of process with -np. I second that.
>
> If the above fails, you might want to post your machine file your using.
>

Hi Ethan

What your program prints is process number, not the host name.
To make sure all nodes are responding, you can try this:

http://www.open-mpi.org/faq/?category=running#mpirun-host

For the hostfile/machinefile structure,
including the number of slots/cores/processors, see "man mpiexec".

The OpenMPI FAQ have answers for many of these initial setup questions.
Worth taking a look.

I hope it helps,
Gus Correa