Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] memory binding
From: David Singleton (David.Singleton_at_[hidden])
Date: 2010-12-13 16:22:33


On 12/14/2010 01:29 AM, Jeff Squyres wrote:
> On Dec 10, 2010, at 4:56 PM, David Singleton wrote:
>
>> Is there any plan to support NUMA memory binding for tasks?
>
> Yes.
>
> For some details on what we're planning for affinity, see the BOF slides that I presented at SC'10 on the OMPI web site (under "publications").
>

I didnt see memory binding in their explicitly.

>> Even with bind-to-core and memory affinity in 1.4.3 we were seeing 15-20%
>> variation in run times on a Nehalem cluster. This turned out to be mostly due
>> to bad page placement. Residual pagecache pages from the last job on a node (or
>> the memory of a suspended job in the case of preemption) could occasionally cause
>> a lot of non-local page placement. We hacked the libnuma module to MPOL_BIND
>> tasks to their local memory and eliminated the majority of this variability.
>> We are currently running with this as default behaviour since its "the right
>> thing" for 99% of jobs (we have an environment variable to back off to affinity
>> for the rest).
>
> What OS and libnuma version are you running? It has been my experience that libnuma can lie on RHEL 5 and earlier. My (possibly flawed) understanding is that this is because of lack of proper kernel support; such "proper" kernel support was only added fairly recently (2.6.30something).

That's interesting. By "lie", do you mean processes are not really memory bound?
We're running 2.6.27.55 (and numactl 0.9.8-11.el5) and I've done quite a bit of
testing that always looks correct.

>
> That aside, it's somewhat disappointing that MPOL_PREFERRED is not working well and that you had to switch to MPOL_BIND. :-(

I'm not sure its disappointing - I think it's just to be expected. For sites that
drop caches or run a whole node memhog or reboot nodes between jobs, MPOL_PREFERRED
will do the right thing. For sites that are not so careful or use suspend/resume
scheduling, memory overcommits and some amount of page reclaim or paging on job
startup will happen occasionally. Paying the extra cost of making sure that page
reclaim or paging results in ideal locality is definitely a big win for a job
overall. (Paging suspended jobs back in after they are resumed can undo some of
their ideal placement but that can be handled.)

>
> Should we add an MCA parameter to switch between BIND and PREFERRED, and perhaps default to BIND?

I'm not sure BIND should be the default for everyone - memory imbalanced jobs might
page badly in this case. But, yes, we would like an MCA to choose and allow sites
to select BIND as their default if they wish. An mpirun option like --bind-to-mem
would need a preferred/affinity alternative and I'm not sure how of a nice notation/
syntax for that.

Cheers,
David