I found out that if I add an
MPI_Barrier after the MPI_Recv part, then there is no minute-long latency.
Is it possible that even if MPI_Recv returns, the openib btl does not
guarantee that the acknowledgement is sent promptly ? In other words, is
it possible that the computation following the MPI_Recv delays the
acknowledgement ? If so, is it supposed to be this way, or is it normal,
and why isn't the same behavior observed with the tcp btl ?
Le 2013-02-14 11:50, Maxime Boissonneault a écrit :
> I have a strange case here. The application is "plink"
> (http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The
> computation/communication pattern of the application is the following :
> 1- MPI_Init
> 2- Some single rank computation
> 3- MPI_Bcast
> 4- Some single rank computation
> 5- MPI_Barrier
> 6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a
> 6- other ranks use MPI_Recv
> 7- Some single rank computation
> 8- other ranks send result to rank 0 with MPI_Ssend
> 8- rank 0 receives data with MPI_Recv
> 9- rank 0 analyses result
> 10- MPI_Finalize
> The amount of data being sent is of the order of the kilobytes, and we
> have IB.
> The problem we observe is in step 6. I've output dates before and
> after each MPI operation. With the openib btl, the behavior I observe
> is that :
> - rank 0 starts sending
> - rank n receives almost instantly, and MPI_Recv returns.
> - rank 0's MPI_Ssend often returns _minutes_.
> It looks like the acknowledgement from rank n takes minutes to reach
> rank 0.
> Now, the tricky part is that if I disable the openib btl to use
> instead tcp over IB, there is no such latency and the acknowledgement
> comes back within a fraction of a second. Also, if rank 0 and rank n
> are on the same node, the acknowledgement is also quasi-instantaneous
> (I guess it goes through the SM btl instead of openib).
> I tried to reproduce this in a simple case, but I observed no such
> latency. The duration that I got for the whole communication is of the
> order of milliseconds.
> Does anyone have an idea of what could cause such very high latencies
> when using the OpenIB BTL ?
> Also, I tried replacing step 6 by explicitly sending a confirmation :
> - rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
> - rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0
> In this case also, rank n's MPI_Isend executes quasi-instantaneously,
> and rank 0's MPI_Recv only returns a few minutes later.
> Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique