Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Very high latency with openib btl
From: Maxime Boissonneault (maxime.boissonneault_at_[hidden])
Date: 2013-02-15 14:29:48


Hi again,
I found out that if I add an
MPI_Barrier after the MPI_Recv part, then there is no minute-long latency.

Is it possible that even if MPI_Recv returns, the openib btl does not
guarantee that the acknowledgement is sent promptly ? In other words, is
it possible that the computation following the MPI_Recv delays the
acknowledgement ? If so, is it supposed to be this way, or is it normal,
and why isn't the same behavior observed with the tcp btl ?

Maxime Boissonneault

Le 2013-02-14 11:50, Maxime Boissonneault a écrit :
> Hi,
> I have a strange case here. The application is "plink"
> (http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The
> computation/communication pattern of the application is the following :
>
> 1- MPI_Init
> 2- Some single rank computation
> 3- MPI_Bcast
> 4- Some single rank computation
> 5- MPI_Barrier
> 6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a
> time.
> 6- other ranks use MPI_Recv
> 7- Some single rank computation
> 8- other ranks send result to rank 0 with MPI_Ssend
> 8- rank 0 receives data with MPI_Recv
> 9- rank 0 analyses result
> 10- MPI_Finalize
>
> The amount of data being sent is of the order of the kilobytes, and we
> have IB.
>
> The problem we observe is in step 6. I've output dates before and
> after each MPI operation. With the openib btl, the behavior I observe
> is that :
> - rank 0 starts sending
> - rank n receives almost instantly, and MPI_Recv returns.
> - rank 0's MPI_Ssend often returns _minutes_.
>
> It looks like the acknowledgement from rank n takes minutes to reach
> rank 0.
>
> Now, the tricky part is that if I disable the openib btl to use
> instead tcp over IB, there is no such latency and the acknowledgement
> comes back within a fraction of a second. Also, if rank 0 and rank n
> are on the same node, the acknowledgement is also quasi-instantaneous
> (I guess it goes through the SM btl instead of openib).
>
> I tried to reproduce this in a simple case, but I observed no such
> latency. The duration that I got for the whole communication is of the
> order of milliseconds.
>
> Does anyone have an idea of what could cause such very high latencies
> when using the OpenIB BTL ?
>
> Also, I tried replacing step 6 by explicitly sending a confirmation :
> - rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
> - rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0
>
> In this case also, rank n's MPI_Isend executes quasi-instantaneously,
> and rank 0's MPI_Recv only returns a few minutes later.
>
> Thanks,
>
> Maxime Boissonneault

-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique