Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Allreduce hangs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-07-02 19:29:27


On Jun 27, 2012, at 6:32 PM, Martin Siegert wrote:

> However, there is another issue that may affect the performance of the 1.6.1
> version. I see a LOT of the following messages on stderr:
>
> --------------------------------------------------------------------------
> The OpenFabrics (openib) BTL failed to register memory in the driver.
> Please check /var/log/messages or dmesg for driver specific failure
> reason.
> The failure occured here:
>
> Local host: b413
> Device: mlx4_0
> Function: openib_reg_mr()
> Errno says: Cannot allocate memory (errno=12)
>
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------

There's been a LOT of discussion about this by the developers (both on-line and off).

We've removed that error message, so at least you won't see it ad infinitum.

What's happening is that you're getting a registered memory imbalance -- see http://blogs.cisco.com/performance/registered-memory-imbalances/ for some details.

The fix we put in solves registered memory exhaustion in most cases (it falls back to send/recv in that case), but due to OMPI's lazy wire up, it can still happen later (e.g., late in an application you do an MPI_SEND to a new recipient, but it can't allocate a new QP because it's out of registered memory).

It turns out to be a rather sticky problem to solve. We're still debating. :-\

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/