Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Cgroup resource limits
From: Christopher Samuel (samuel_at_[hidden])
Date: 2012-11-05 21:16:16

Hash: SHA1

On 06/11/12 13:01, Ralph Castain wrote:

> Depends on the use-case. If you are going to direct-launch the
> processes (e.g., using srun), then you are correct.


> However, that isn't the case in other scenarios. For example, if
> you get an allocation and then use mpirun to launch your job, you
> definitely do *not* want the RM setting the cgroup constraints as
> the RM only launches the orteds - it never sees the MPI procs. The
> constraints are to apply to the individual procs as separate
> entities - if you apply them to the orteds, then all procs will be
> constrained to the same container. Ick.

That's not been my experience recently; for instance Torque currently
creates a cpuset for your job containing all the procs you've been
given there and then you can use mpirun/mpiexec to launch orted across
all the nodes you've been given. Those processes are then constrained
to the allocation set up on each node. They are free to bind
themselves to the cores present within that cpuset should they so wish.

In the very beginnings (when I was at VPAC and when wew were using
MVPAICH2 rather than OpenMPI) Torque would bind processes to a core
within the allocation which worked fine for that, but of course broke
in the way you explain when we moved to Open-MPI. I fixed that bug up
very quickly.. ;-)

We've only ever run Slurm on BlueGene where this isn't an issue, so I
don't know if that does things differently.

> Similarly, if you are running MapReduce, your application has to
> figure out what nodes to run on, how much memory will be required,
> etc. All that goes into the allocation request (made by the
> equivalent of mpirun in that scenario) sent to the RM. Again, the
> orteds need to set those constraints on a per-process basis.

But for the scheduler to be able to plan workload well I believe that
once your job has started the best you can do is ask for less than you
have been given, otherwise you're free to game the system by queuing a
short small job and once it's started asking for many more cores or
RAM.. :-)

> So we need the capability in ORTE to support the non-direct-launch
> cases.

I'm pretty sure we're agreeing here, just in different ways of
expressing ourselves.. :-)

- --
 Christopher Samuel Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545

Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla -