Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Memory affinity
From: David Singleton (David.Singleton_at_[hidden])
Date: 2010-09-27 17:50:04


On 09/28/2010 06:52 AM, Tim Prince wrote:
> On 9/27/2010 12:21 PM, Gabriele Fatigati wrote:
>> HI Tim,
>>
>> I have read that link, but I haven't understood if enabling processor
>> affinity are enabled also memory affinity because is written that:
>>
>> "Note that memory affinity support is enabled only when processor
>> affinity is enabled"
>>
>> Can i set processory affinity without memory affinity? This is my
>> question..
>>
>>
>> 2010/9/27 Tim Prince<n8tm_at_[hidden]>
>>> On 9/27/2010 9:01 AM, Gabriele Fatigati wrote:
>>>> if OpenMPI is numa-compiled, memory affinity is enabled by default?
>>>> Because I didn't find memory affinity alone ( similar) parameter to
>>>> set at 1.
>>>>
>>>>
>>> The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>>> has a useful introduction to affinity. It's available in a default
>>> build, but not enabled by default.
>>>
> Memory affinity is implied by processor affinity. Your system libraries
> are set up so as to cause any memory allocated to be made local to the
> processor, if possible. That's one of the primary benefits of processor
> affinity. Not being an expert in openmpi, I assume, in the absence of
> further easily accessible documentation, there's no useful explicit way
> to disable maffinity while using paffinity on platforms other than the
> specified legacy platforms.
>

Memory allocation policy really needs to be independent of processor
binding policy. The default memory policy (memory affinity) of "attempt
to allocate to the NUMA node of the cpu that made the allocation request
but fallback as needed" is flawed in a number of situations. This is true
even when MPI jobs are given dedicated access to processors. A common one is
where the local NUMA node is full of pagecache pages (from the checkpoint
of the last job to complete). For those sites that support suspend/resume
based scheduling, NUMA nodes will generally contain pages from suspended
jobs. Ideally, the new (suspending) job should suffer a little bit of paging
overhead (pushing out the suspended job) to get ideal memory placement for
the next 6 or whatever hours of execution.

An mbind (MPOL_BIND) policy of binding to the one local NUMA node will not
work in the case of one process requiring more memory than that local NUMA
node. One scenario is a master-slave where you might want:
   master (rank 0) bound to processor 0 but not memory bound
   slave (rank i) bound to processor i and memory bound to the local memory
         of processor i.

They really are independent requirements.

Cheers,
David