Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
From: Ethan Deneault (edeneault_at_[hidden])
Date: 2010-09-21 14:41:53


Prentice Bisbal wrote:

>
> I'm assuming you already tested ssh connectivity and verified everything
> is working as it should. (You did test all that, right?)

Yes. I am able to log in remotely to all nodes from the master, and to each node from each node
without a password. Each node mounts the same /home directory from the master, so they have the same
copy of all the ssh and rsh keys.

> This sounds like configuration problem on one of the nodes, or a problem
> with ssh. I suspect it's not a problem with the number of processes, but
> whichever node is the 4th in your machinefile has a connectivity or
> configuration issue:
>
> I would try the following:
>
> 1. reorder the list of hosts in your machine file.
> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination of 3 other nodes hangs during
communication. This includes not using --machinefile and using -host; i.e.

$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
  node 1 : Hello world
  node 0 : Hello world
  node 2 : Hello world

> 2. Run the mpirun command from a different host. I'd try running it from
> several different hosts.

The mpirun command does not seem to work when launched from one of the nodes. As an example:

Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)

> I think someone else recommended that you should be specifying the
> number of process with -np. I second that.
>
> If the above fails, you might want to post your machine file your using.

The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope

Cheers,
Ethan

-- 
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555