Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-22 13:22:16

On Jul 22, 2009, at 11:17 AM, Sylvain Jeaugey wrote:

> I'm interested in joining the effort, since we will likely have the
> same
> problem with SLURM's cpuset support.


> > But as to why it's getting EINVAL, that could be wonky. We might
> want to
> > take this to the PLPA list and have you run some small, non-MPI
> examples to
> > ensure that PLPA is parsing your /sys tree properly, etc.
> I don't see the /sys implication here. Can you be more precise on
> which
> files are read to determine placement ?

Check in opal/mca/paffinity/linux/plpa/src/libplpa/

> IIRC, when you are inside a cpuset, you can see all cpus (/sys
> should be
> unmodified) but calling set_schedaffinity with a mask containing a cpu
> outside the cpuset will return EINVAL.

Ah, that could be the issue.

> The only solution I see to solve
> this would be to get the "allowed" cpus with sched_getaffinity,
> which should be set according to the cpuset mask.

There are two issues here:

- what should OMPI do
- what should PLPA do

PLPA currently does two things:

1. provide a portable set/get affinity API (to isolate you from
whatever version you have in your linux install)
2. provide topology mapping information (sockets, cores)

PLPA does not currently deal with cpusets. If we want to expand PLPA
to somehow interact with cpusets, that should probably be brought up
on the PLPA mailing lists (someone made this suggestion to me about a
month or two ago and I haven't had a chance to follow up on it :-( ).

OMPI (as a whole -- meaning: including the ORTE layer) does the

1. decide whether to bind MPI processes or not
2. if we do bind, use the paffinity module to bind processes to
specific processors (the linux paffinity module uses PLPA to do the
actual binding -- PLPA is wholly embedded inside OMPI's linux
paffinity module)

And there's two layers involved here:

- the main ORTE logic saying both "yes, bind" and making the decision
as to which processors to bind to
- the linux paffinity component does a thin layer of translation
between ORTE's/OMPI's requests and calling the back-end PLPA library

As Ralph described, OMPI is currently fairly "dumb" about how it
chooses which processors it uses -- 0 to N-1. I think the issue here
is to make OMPI smarter about how it chooses which processors to use.
It could be in ORTE itself, or it could be in the linux paffinity
translation layer (e.g., linux paffinity component could report only
as many processors as are available in the cpuset...? And binding
could be relative to the cpuset...?).

Jeff Squyres