Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Cluster Communications Issues
From: Lee Manko (lmanko_at_[hidden])
Date: 2010-02-02 12:02:10


This is my first attempt at configuring a Beowulf cluster running MPI. ALL
of the nodes in the cluster are PS3s running Yellow Dog Linux 6.2 and the
host (server) is a Dell i686 Quad-core running Fedora Core 12. The cluster
is running openMPI v1.4.1 configured (non-homogeneous), compiled and
installed individually on each node and the server. I have an NSF shared
directory on the host where the application resides after building. All
nodes have access to the shared volume and can see all files in the shared
volume. SSH is configured and the server can remote into each node without
using a password and vice versa. The built-in firewalls (iptables and
ip6tables) are disabled. The server has a dual Ethernet card. The first
eth1, is used for cluster communications and has a static IP address of
192.168.0.1. The second, eth2 is used to communicate with the outside world
and is connected to a corporate network getting a DHCP assigned IP address..

I have a very simple master/slave framework application where the slave does
a simple computation and return the result and the processor name. The
master farms out 1024 such tasks to the slaves and after finalizing the
master exists.

When I run the code locally on the multiple cores on either the server or
the PS3, the code executes and completes as expected. However, when I have
mpirun spread the work across the nodes, the process hangs waiting for
messages to be passed between the server and the nodes. What I have
discovered is that if I unplug the second NIC running DHCP the process
executes fine.

I have requested a static IP address from the network admin, but was curious
as to whether anyone has run into this when running DHCP?

Thanks.

Lee Manko