Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] busy waiting and oversubscriptions
From: Gus Correa (gus_at_[hidden])
Date: 2014-03-26 19:21:03

On 03/26/2014 05:26 PM, Ross Boylan wrote:
> [Main part is at the bottom]
> On Wed, 2014-03-26 at 19:28 +0100, Andreas Schäfer wrote:
>> Ross-
>> On 09:08 Wed 26 Mar , Ross Boylan wrote:
>>> On Wed, 2014-03-26 at 10:27 +0000, Jeff Squyres (jsquyres) wrote:
>>>> On Mar 26, 2014, at 1:31 AM, Andreas Schäfer <gentryx_at_[hidden]> wrote:
> ...
>>> This seems to restate the premise of my question. Is it meant to lead
>>> to the answer "A process in busy wait blocks other users of the CPU to
>>> the same extent as any other process at 100%"?
>> Yes.
> Thanks for confirming.
>>>>>> At any rate, my question is whether, if I have processes that spend most
>>>>>> of their time waiting to receive a message, I can put more of them than
>>>>>> I have physical cores without much slowdown?
>>>>> AFAICS there will always be a certain slowdown. Is there a reason why
>>>>> you would want to oversubscribe your nodes?
>>>> Agreed -- this is not a good idea. It suggests that you should make your existing code more efficient -- perhaps by overlapping communication and computation.
>>> My motivation was to get more work done with a given number of CPUs, and
>>> also to find out how much of burden I was imposing on other users.
>>> My application consists of processes that have different roles. Some of
>>> the roles don't have much to do (they play important roles, but don't do
>>> much computation). My hope was that I could add them on without
>>> imposing much of a burden.
>> If you have a complex workflow with varying computational loads, then
>> you might want to take a look at runtime systems which allow you to
>> express this directly through their API, e.g. HPX[1]. HPX has proven to
>> run with high efficiency on a wide range of architectures, and with a
>> multitude of different workloads.
> Thanks for the pointer.
>>> Second, we do not operate in a batch queuing environment
>> Why not fix that?
> I'm not the sysadmin, though I'm involved in the group that sets policy.
> At one point we were using Sun's grid engine, but I don't think it's
> installed now. I'm not sure why.
> We have discussed putting in a batch queuing system and nobody was
> really pushing for it. My impression was (and probably still is) that
> it was more pain than gain. There is hassle not only for the sysadmin
> to set it up (and, I suppose, monitor it), but for users. Personally I
> run a lot of interactive parallel jobs (the interaction is on rank 0
> only). I have the impression that won't work under a batch system,
> though I could be wrong. I also had the impression we'd need to have an
> estimate of how long the job would run when we submit, and we don't
> always know.

But I've never really used such a system, and may not appreciate what it
would get us. The other reason we haven't bothered is that the load on
the cluster was relatively light and contention was low. That is less
and less true, which probably starts tipping the balance toward a
queuing system.

This is wandering off topic, but if you or anyone else could say more
about why you regard the absence of a queuing system as a problem that
should be fixed, I'd love to hear it.


Hi Ross

Some pros:
(I don't know of any cons.)

Torque+Maui, SGE/OGE, and Slurm are free.
There are commercial products as well.

Installation and initial configuration may take some effort,
but after that it is mostly peace of mind, and occasional tuning to the

You can build OpenMPI integrated to them (no need for a hostfile to
submit jobs,
OpenMPI will use whatever nodes the queue system gave you).

If you build the queue system with cpuset control, a node can be shared
among several jobs, but the cpus/cores will be assigned specifically
to each job's processes, so that nobody steps on each other toes.
(There is similar control over the memory used per job as well.)

Queue systems won't allow resources to be oversubscribed.
As it is now, what else but courtesy and a great deal of coordination
would guarantee that you and your colleagues won't launch
several computationally demanding jobs on the same node, using the same
cpus, perhaps using more memory than the available RAM,
maybe forcing the system to swap to disk, and ruining performance?
I've been to an organization that didn't want to use a queue system,
and where people would have to go knocking on doors
to ask things like: "Would you please release nodes 01 to 32?
You have processes leftover from a dead job running on them for a week,
taking 100% CPU, and there are no nodes available."
The queue system avoids that, it has courtesy and coordination built in,
so to speak.

You can configure the queue system from very simple to quite complex
resource use policies, with queues for specific types of jobs, etc.
You can start with single queue and a first-in-first-out job policy,
then make it more complex as the workload increases.

Queue systems do support interactive jobs (even with X-windows GUIs, if
You submit the interactive job, the queue system puts you in
a free node, and you work normally there.

I hope this helps,
Gus Correa