Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-02-12 17:23:50


On Feb 10, 2006, at 12:18 PM, James Conway wrote:

>> Open MPI uses random port numbers for all it's communication.
>> (etc)
>
> Thanks for the explanation. I will live with the open Firewall, and
> look at the ipfw docs for writing a script.

That may be somewhat difficult. We previously looked into making LAM/
MPI work behind firewalls and ran into some unexpected issues -- the
short version was that, at least for the way LAM was setup, even if
you could restrict the port numbers that LAM would choose for its TCP
communications, you had to have a virtual host out in front of the
firewall that would relay the traffic to the appropriate internal
host. Specifically, you had to have an IP address out in front of
the firewall for each host so that it would route to the appropriate
back-end instance of the MPI application on the appropriate host.

The real solution here is to have Open MPI be able to route its TCP
communications around through multiple hosts instead of assuming that
it is always talking directly to the target host. (LAM actually had
the run-time layer version of that implemented eons ago, but we've
never used it -- and more changes would be needed up at the TCP layer
to do the same thing)

We have not yet added any TCP routing capabilities in Open MPI. It's
on the long-range to-do list (meaning: several of us have talked
about it and agree that it's a good idea, but no one has committed to
any timeframe as to when it would be done). Contributions from the
community would be greatly appreciated. :-)

> Now I have a more "core" OpenMPI problem, which may be just
> unfamiliarity on my part. I seem to have the environment variables
> set up alright though - the code runs, but doesn't complete.
>
> I have copied the "MPI Tutorial: The cannonical ring program" from
> <http://www.lam-mpi.org/tutorials/>. It compiles and runs fine on the
> localhost (one CPU, one or more MPI processes). If I copy it to a
> remotehost, it does one round of passing the 'tag' then stalls. I
> modified the print statements a bit to see where in the code it
> stalls, but the loop hasn't changed. This is what I see happening:
> 1. Process 0 successfully kicks off the pass-around by sending the
> tag to the next process (1), and then enters the loop where it waits
> for the tag to come back.
> 2. Process 1 enters the loop, receives the tag and passes it on (back
> to process 0 since this is a ring of 2 players only).
> 3. Process 0 successfully receives the tag, decrements it, and calls
> the next send (MPI_Send) but it doesn't return from this. I have a
> print statement right after (with fflush) but there is no output. The
> CPU is maxed out on both the local and remote hosts, I assume some
> kind of polling.
> 4. Needless to say, Process 1 never reports receipt of the tag.
>
> Output (with a little re-ordering to make sense) is:
> mpirun --hostfile my_mpi_hosts --np 2 mpi_test1
> Process rank 0: size = 2
> Process rank 1: size = 2
> Enter the number of times around the ring: 5
>
> Process 0 doing first send of '4' to 1
> Process 0 finished sending, now entering loop
>
> Process 0 waiting to receive from 1
>
> Process 1 waiting to receive from 0
> Process 1 received '4' from 0
> Process 1 sending '4' to 0
> Process 1 finished sending
> Process 1 waiting to receive from 0
>
> Process 0 received '4' from 1
>>> Process 0 decremented num
> Process 0 sending '3' to 1
> !---- nothing more - hangs at 100% cpu until ctrl-
> !---- should see "Process 0 finished sending"
>
> Since process 0 succeeds in calling MPI_Send before the loop, and in
> calling MPI_Recv at the start of the loop, the communications appear
> to be working. Likewise, process 1 succeeds in receiving and sending
> within the loop. However, if its significant, these calls work one
> time for each process - the second time MPI_Send is called by process
> 0, there is a hang.

Well that is definitely odd. The fact that the first send finishes
and the second does not is quite fishy. A few questions:

- Have you absolutely entirely disabled all firewalling between the
two hosts?
- Do you have only one TCP interface on both machines? If you have
more than one, we can try telling Open MPI to ignore one of them.

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/