Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Mixing Linux's CPU-shielding with mpirun's bind-to-core
From: Siddhartha Jana (siddharthajana24_at_[hidden])
Date: 2013-08-18 08:51:31

Thanks for the reply,

> My requirements:

> > 1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my
> > process.
> > 2. Avoid rescheduling of processes to other cores.
> >
> > My solution: I use Linux's CPU-shielding.
> > [ Man page:
> >
> > ]
> > I create a cpuset called "socket1" with cores 8-15 in the dev fs. I
> > iterate through all the tasks in /dev/cpuset/tasks and copy them to
> > /dev/cpuset/socket1/tasks
> Most of these existing tasks are system tasks. Some actually *want* to
> run on specific cores outside of socket1. For instance some kernel
> threads are doing the scheduler load balancing on each core. Others are
> doing defered work in the kernel that your application may need. I
> wonder what happens when you move them. The kernel may reject your

request, or it may actually break things.

Yes, when I move all system tasks, the movable kernel tasks are easily
moved without complains. The ones that can't be moved return an error code.
But since their CPU usage is very less, I decide to ignore them anyway.
Nothing breaks really.

> Also most of these tasks do nothing but sleeping 99.9% of the times
> anyway. If you're worried about having too many system tasks on your
> applications' core, just make sure you don't install useless packages
> (or disable some services at startup).
For my use case, I have ensured that the heavy tasks that I wanted to be
moved out of socket0 could be moved without complaints. The non-movable
ones, as I mentioned, were left as is.

> If you *really* want to have 100% CPU for your application on cores 0-7,
> be aware that other things such as interrupts will be stealing some CPU
> cycles anyway.

Noted. As mentioned, the tasks that really matter were safely moved to a
different socket.

> > I create a cpuset called "socket0" with cores 0-7 .
> > At the start of the application, (before MPI_Init()), I schedule my
> > MPI process on the cpuset as follows:
> > ------------------------------------------------------
> > sprintf(str,"/bin/echo %d >> /dev/cpuset/socket0/tasks ",mypid);
> > system(str);
> > ------------------------------------------------------
> > In order to ensure that my processes remain bound to the cores, I am
> > passing the --bind-to-core option to mpirun. I do this, instead of
> > using sched_setaffinity from within the application. Is there a chance
> > that mpirun's "binding-to-core" will clash with the above ?
> Make sure you also specified the NUMA node in your cpuset "mems" file
> too.That's required before the cpuset can be used (otherwise adding a

task will fail). And make sure that the application can add itself to
> the cpuset, usually only root can add tasks to cpusets.
Yes, I have ensured all of these. The application has enough rights to add
itself to the cpuset.

> And you may want to open/write/close on /dev/cpuset/socket0/tasks and
> check the return values instead of this system() call.
Checked. Everything works as expected.

> If all the above works and does not return errors (you should check that
> your application's PID is in /dev/cpuset/socket0/tasks while running),
> bind-to-core won't clash with it, at least when using a OMPI that uses
> hwloc for binding (v1.5.2 or later if I remember correctly).
My concern is that hwloc is used before the application begins executing
and so mpirun might use it to bind the application to different cores than
the ones I want them to bind to. If there were a way to specify the cores
through the hostfile, this problem should be solved. Is there? I sit
possible to specify the "cores" in the hostfile.

> > While this solution seems to work temporarily, I am not sure whether
> > this is good solution.
> Usually the administrator or PBS/Torque/... creates the cpuset and
> places tasks in there for you.
Yes, this is what was done in my case for the kernel tasks.