I not 100% sure but I think I might know what's wrong. I can reproduce
something similar (oddly it does not happens all the time) if I activate
my firewall and let all the trafic through (ie. accept all connections).
In few words, I think the firewall (even when disabled) introduce some
delays in the setup stage of the TCP connection and we "kind of" lose one
of the messages. Let me find a high delay cluster and I will take a look.
On Fri, 10 Feb 2006, James Conway wrote:
> I have copied the "MPI Tutorial: The cannonical ring program" from
> <http://www.lam-mpi.org/tutorials/>. It compiles and runs fine on the
> localhost (one CPU, one or more MPI processes). If I copy it to a
> remotehost, it does one round of passing the 'tag' then stalls. I
> modified the print statements a bit to see where in the code it
> stalls, but the loop hasn't changed. This is what I see happening:
> 1. Process 0 successfully kicks off the pass-around by sending the
> tag to the next process (1), and then enters the loop where it waits
> for the tag to come back.
> 2. Process 1 enters the loop, receives the tag and passes it on (back
> to process 0 since this is a ring of 2 players only).
> 3. Process 0 successfully receives the tag, decrements it, and calls
> the next send (MPI_Send) but it doesn't return from this. I have a
> print statement right after (with fflush) but there is no output. The
> CPU is maxed out on both the local and remote hosts, I assume some
> kind of polling.
> 4. Needless to say, Process 1 never reports receipt of the tag.
> Since process 0 succeeds in calling MPI_Send before the loop, and in
> calling MPI_Recv at the start of the loop, the communications appear
> to be working. Likewise, process 1 succeeds in receiving and sending
> within the loop. However, if its significant, these calls work one
> time for each process - the second time MPI_Send is called by process
> 0, there is a hang.
> I am using Mac OSX 10.4.4 and gcc 4.0.1 on both systems, with OpenMPI
> 1.0.1 installed (compiled from sources). The small tutorial code is
> below (I hope its OK to include here), with the few printf mods that
> I made.
"We must accept finite disappointment, but we must never lose infinite
Martin Luther King