On Tue, Jun 19, 2007 at 03:40:36PM -0400, Jeff Squyres wrote:
> On Jun 19, 2007, at 2:24 PM, George Bosilca wrote:
> > 1. I don't believe the OS to release the binding when we close the
> > socket. As an example on Linux the kernel sockets are release at a
> > later moment. That means the socket might be still in use for the
> > next run.
> So...? If you define a large enough range, it's not a big enough
> deal -- if you use port N for one run, if you start another run right
> after the first one finishes, you'll use port N+1.
This is indeed the assumption I am working under.
> That being said, I am equally dubious about restricting to specific
> port ranges, but for different reasons:
> 1. If you're trying to go through firewalls, this isn't enough.
> You'll also need "external" IP addresses for each internal IP
> address. This alone is such a hassle that it really makes the
> concept not worth it (and no competent network/firewall admin would
> agree to do it ;-) ). Instead, you'd want a *single* punch-through
> in the firewall to communicate between processes in front of and
> behind the firewall, and then have some MPI-level routing to
> multiplex all relevant MPI communication through that single pinhole.
Very true. As I intimated in my previous reply, these are per-machine
> 2. If your range is small enough and you execute lots and lots of
> short jobs on the same nodes, you could run out of available ports in
> the range while the kernel is shutting down the sockets from the
> previous runs.
The job in question here takes many hours/days to run, and nProc << nPorts, so
this shouldn't be an issue.
> This is why I asked about the network topology in my previous mail.
OK, now time to report the results of my recent set of tests...
Machines in cluster Processes per node Stuck in MPI_Barrier?
01 - 10 2 yes
01 - 10 1 yes
01 - 03 1 no
01 - 10 1 yes
01 - 09 1 no
01 - 04, 06 - 10 1 yes
01 - 09 2 no
Using 10 machines invariably causes the job to get stuck in MPI_Barrier.
Reducing the number of machines to 9 causes the job to continue. Number of
machines is ruled out as a factor by changing which 9 machines are used. It
appears that including machine 10 in the cluster is what causes the job to get
stuck. Machine 10 is the machine that is in a different location in the
building. Cutting this machine out and increasing the number of processes also
seems to work.
Machine 10 is the key...
The fact that I can work around this problem means that finding the solution
is not quite so pressing for me now, however I'm still curious as to what the
underlying problem is. I've spoken to the network admin and he confirms my
understanding of the network layout. My next test should be to move a second
machine to the alternative location and see if this affects the results
(perhaps there something special about the setup of machine 10?). This will
have to wait until these jobs complete.
If you OpenMPI folks are still interested in helping me trace the problem then
I will gladly accept your help. If not, then I'll make do and fade into the
background until I need to call upon your wisdom again! :-p
Thanks for all your help,