Dear Open-MPI list:
I'm trying to run two (soon to be three) dual opteron machines as a
cluster (network of workstations - they each have a disk and OS). I
can ssh between machines with no password. My open-mpi code compiled
fine and works great as an SMP program (using both processors on one
machine). However, I am not able to run my open-mpi program parallel
between the two computers.
For SMP work I use:
mpirun -np 2 myprogram inputfile >outputfile
For cluster work I have tried:
mpirun --hostfile myhostfile -np 4 myprogram inputfile >outputfile
which does not write to the output file.
I have also tried:
mpirun --hostfile myhostfile -np 4 `myprogram inputfile >outputfile`
which just ran serially on the initial machine.
The open-mpi executable and libraries are on the head node NFS shared
to the slave node. Both computers can run open-mpi [the open-mpi
application] as an SMP program with no problems. When I am trying to
run the open-mpi program with both computers, I am using a directory
that is an NFS share to the other computer.
I am running OpenSUSE 10.2 on both machines. I compiled with gcc 41 /
ifort 9.1.
I am using a gigabit network.
My hostfile specifies slots=2 max-slots=2 for each computer. The
computers are identified in the hostfile using the /etc/hosts alias.
The only config.log that I found was in the directory I used to make
open-mpi; since everything works as SMP, I am not including that file
with this initial message.
What should I be trying to do next to remedy this issue?
Any help would be appreciated.
Thanks,
Mark Kosmowski
|