Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] The option "--mca hwloc_base_mem_alloc_policy local_only" doesn't bind memory to numa node
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-05-29 12:11:45


On May 28, 2012, at 3:48 PM, François Guertin wrote:

> I try to bind both the memory and processes on our compute cluster nodes
> but only the processes binding works. How can I also specify to allocate
> the memory on the same numa node than where the process is bind? I tried
> with the option "--mca hwloc_base_mem_alloc_policy local_only" without
> any luck.
> [snip]

Thanks for your very detailed message -- it made it possible to completely understand your question, and (hopefully) answer it properly. :-)

I think the issue here is that the help message for hwloc_base_mem_alloc_policy isn't quite worded properly:

>> MCA hwloc: parameter "hwloc_base_mem_alloc_policy" (current value:
>> <none>, data source: default value)
>> Policy that determines how general memory
>> allocations are bound after MPI_INIT. A value of "none" means that no
>> memory policy is applied. A value of "local_only" means that all
>> memory allocations will be restricted to the local NUMA node where
>> each process is placed. Note that operating system paging policies
>> are unaffected by this setting. For example, if "local_only" is used
>> and local NUMA node memory is exhausted, a new memory allocation may
>> cause paging.

At issue is the fact that I probably should not have used to word "bound" in the first sentence, and probably clarified that memory is *not* bound.

Specifically, when you set hwloc_base_mem_alloc_policy to "local_only", that only sets the policy of where newly malloced code is placed. Even more specifically: it does *not* bind the memory, meaning that if your process' memory is swapped out, it could get swapped in to a new location (yoinks!).

That being said, most HPC apps don't swap, so it's *usually* not an issue. But, of course, after you malloc memory (which will be physically located on your local NUMA node), you could bind it, too, if you want.

Open MPI doesn't bind user-allocated memory (except possibly those that are passed as message buffers to functions like MPI_SEND and MPI_RECV) because that would mean that we have to intercept calls like malloc, calloc, etc. And we don't really want to be in that business.

(disclaimer: we sorta do intercept malloc, calloc, etc. in some cases -- but we really don't want to, and don't do it in all cases. I can explain more if you care)

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/