This is my first attempt at configuring a Beowulf cluster running MPI. ALL
of the nodes are PS3s running Yellow Dog Linux 6.2 and the host (server) is
a Dell i686 Quad-core running Fedora Core 12. Thanks to a couple of members
on this forum (in a previous question), I learned that I needed to download
the openmpi code, configure, compile and install it on each of my machines.
I downloaded v1.4.1. I configured openmpi for non-heterogeneous and
compiled and installed individually on each node and the server. I have an
NSF shared directory on the host where the application resides after
building. All nodes have access to the shared volume and they can see any
files in the shared volume. SSH is configured and the server can remote
into each node without using a password and vice versa. The built-in
firewalls (iptables and ip6tables) are disabled.
I downloaded and modified a very simple master/slave framework application
where the slave does a simple computation and gets the processor name. The
slave returns both pieces of information to the master who then simply
displays it in the terminal window. The master farms out 1024 such tasks to
the slaves and after finalizing the master exists.
I run the application in one of three ways:
1. mpirun np 2 host_application - launched and run locally on the
server and uses one of it remaining 3 cores as a slave
2. mpirun np 1 node_application - launched and run locally on the node
and uses the second slot as a slave
3. mpirun np 1 --host host_name host_application ; -np 1 --host
hostfile node_application - runs host_application as master on the Dell
server and runs node_application as a slave (rank=1) on the first PS3.
host_application and node_application are identical but compiled on their
respective machines to create loadable executables for that machine.
OK, so methods 1 and 2 run fine and the master farms out 1024 tasks to the
slave. The return values look like I expect. However, when I run method 3,
the application hangs - no error messages, nothing.
What I have discovered through rudimentary debugging (using files) is that
the master (Dell) initiates the MPI_Init call and node_application is
launched on the slave (PS3). The slave recognizes itself as rank 1 and
enters the slave code, which is to wait for the first message from the
master. However, the message from the master, an MPI_Send, is never
received by the slave. Since MPI_Send on the master is blocking and the
MPI_Recv on the slave is also blocking, the processing simply stalls.
This appears to be some kind of configuration issue between Fedora and YDL.
Or, I have not set something up properly.
Please keep in mind that when the applications are running locally, they are
performing the same Init, Send and Recv calls as when farming out to the
cluster, but just no going off board, so to speak. Compiling and running
the application on the native hardware works perfectly (ie: compiled and run
on the PS3 or compiled and run on the Dell). So, I know that the code was
written properly and executing properly locally.
Has anyone else experienced this kind of behavior? Were you able to solve
it? Anyone have any suggestions as to where I might look to resolve this