I managed to reproduce the "bug" with a simple case (see the cpp file
I am running this on 2 nodes with 8 cores each. If I run with
then the MPI_Ssend operations take about ~1e-5 second for intra-node
ranks, and ~11 seconds for inter-node ranks. Note that 11 seconds is
roughly the time required to execute the loop that is after the
MPI_Recv. The average time required for the MPI_Ssend to return is 5.1
If I run with :
mpiexec --mca btl ^openib ./test-mpi-latency.out
then intra-node communications take ~0.5-1e-5 seconds, while internode
communications take ~1e-6 seconds, for an average of ~5e-5 seconds.
I compiled this with gcc 4.7.2 + openmpi 1.6.3, as well as gcc 4.6.1 +
openmpi 1.4.5. Both show the same behavior.
However, on the same machine, with gcc 4.6.1 + mvapich2/1.8, the latency
is always quite low.
The fact that mvapich2 does not show this behavior points out to a
problem with the openib btl within openmpi, and not with our setup.
Can anyone try to reproduce this on a different machine ?
Le 2013-02-15 14:29, Maxime Boissonneault a écrit :
> Hi again,
> I found out that if I add an
> MPI_Barrier after the MPI_Recv part, then there is no minute-long
> Is it possible that even if MPI_Recv returns, the openib btl does not
> guarantee that the acknowledgement is sent promptly ? In other words,
> is it possible that the computation following the MPI_Recv delays the
> acknowledgement ? If so, is it supposed to be this way, or is it
> normal, and why isn't the same behavior observed with the tcp btl ?
> Maxime Boissonneault
> Le 2013-02-14 11:50, Maxime Boissonneault a écrit :
>> I have a strange case here. The application is "plink"
>> (http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The
>> computation/communication pattern of the application is the following :
>> 1- MPI_Init
>> 2- Some single rank computation
>> 3- MPI_Bcast
>> 4- Some single rank computation
>> 5- MPI_Barrier
>> 6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a
>> 6- other ranks use MPI_Recv
>> 7- Some single rank computation
>> 8- other ranks send result to rank 0 with MPI_Ssend
>> 8- rank 0 receives data with MPI_Recv
>> 9- rank 0 analyses result
>> 10- MPI_Finalize
>> The amount of data being sent is of the order of the kilobytes, and
>> we have IB.
>> The problem we observe is in step 6. I've output dates before and
>> after each MPI operation. With the openib btl, the behavior I observe
>> is that :
>> - rank 0 starts sending
>> - rank n receives almost instantly, and MPI_Recv returns.
>> - rank 0's MPI_Ssend often returns _minutes_.
>> It looks like the acknowledgement from rank n takes minutes to reach
>> rank 0.
>> Now, the tricky part is that if I disable the openib btl to use
>> instead tcp over IB, there is no such latency and the acknowledgement
>> comes back within a fraction of a second. Also, if rank 0 and rank n
>> are on the same node, the acknowledgement is also quasi-instantaneous
>> (I guess it goes through the SM btl instead of openib).
>> I tried to reproduce this in a simple case, but I observed no such
>> latency. The duration that I got for the whole communication is of
>> the order of milliseconds.
>> Does anyone have an idea of what could cause such very high latencies
>> when using the OpenIB BTL ?
>> Also, I tried replacing step 6 by explicitly sending a confirmation :
>> - rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
>> - rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0
>> In this case also, rank n's MPI_Isend executes quasi-instantaneously,
>> and rank 0's MPI_Recv only returns a few minutes later.
>> Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique