Prentice Bisbal wrote:
> I'm assuming you already tested ssh connectivity and verified everything
> is working as it should. (You did test all that, right?)
Yes. I am able to log in remotely to all nodes from the master, and to each node from each node
without a password. Each node mounts the same /home directory from the master, so they have the same
copy of all the ssh and rsh keys.
> This sounds like configuration problem on one of the nodes, or a problem
> with ssh. I suspect it's not a problem with the number of processes, but
> whichever node is the 4th in your machinefile has a connectivity or
> configuration issue:
> I would try the following:
> 1. reorder the list of hosts in your machine file.
> 3. Change your machinefile to include 4 completely different hosts.
This does not seem to have any beneficial effect.
The test program run from the master (pleiades) with any combination of 3 other nodes hangs during
communication. This includes not using --machinefile and using -host; i.e.
$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
node 1 : Hello world
node 0 : Hello world
node 2 : Hello world
> 2. Run the mpirun command from a different host. I'd try running it from
> several different hosts.
The mpirun command does not seem to work when launched from one of the nodes. As an example:
Running on node asterope:
asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out
Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports
> I think someone else recommended that you should be specifying the
> number of process with -np. I second that.
> If the above fails, you might want to post your machine file your using.
The machine file is a simple list of hostnames, as an example:
Dr. Ethan Deneault
Assistant Professor of Physics
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555