Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI and DRMAA
From: Reuti (reuti_at_[hidden])
Date: 2010-03-11 07:57:04


Am 11.03.2010 um 03:03 schrieb Brian Smith:

> This may seem like an odd query (or not; perhaps it has been
> brought up
> before). My work recently involves HPC usability i.e. making things
> easier for new users by abstracting away the scheduler. I've been
> working with DRMAA for interfacing with DRMs and it occurred to me:
> what
> would be the advantage to letting the scheduler itself handle farming
> out MPI processes as individual tasks rather than having a wrapper
> like
> mpirun to handle this task via ssh/rsh/etc.?
> I thought about MPI2's ability to do dynamic process management and
> how
> scheduling environments tend to allocate static pools of resources for
> parallel tasks. A DRMAA-driven MPI would be able to request that the
> scheduler launch these tasks as resources become available enabling
> scheduled MPI jobs to dynamically add and remove processors during
> execution. Several applications that I have worked with come to mind,
> where pre-processing and other tasks are non-parallel whereas the
> various solvers are. Being able to dynamically spawn processes
> based on
> where you are in this work-flow could be very useful here.

when I get you in the correct direction of the calls, the MPI library
should issue via DRMAA the startup of tasks to the DRM. So the
complete flow would be:

user => DRMAA.a for MPI application => scheduled MPI application =>
DRMAA.b for tasks => scheduled MPI tasks

IMO the DRMAA.b must be available at some point, as none of the
queuingsystems I have access to can cope with varying needs of a job
during a jobs lifetime. Besides rising and lowering the number of
cores you need, the same applies to memory requests. It was quite
some often on the SGE mailing list, that a job needs a certain amount
of memory for some time:

- 2 GB for 4 hrs
- 4 GB for 20 min
- 1 GB for 6 hrs

There is no interface for now to let a running application tell the
DRM the changed needs - you can only submit it with the maximum
request. As you wouldn't like to have your job halted in the middle,
it would need a new syntax in DRMAA to let the DRM know the maximum
and current needs, so that the gaps could be filled with other jobs.
These other jobs would also need one extension: some kind of flag
"suspendable" . These nice jobs could then run in some way in the
leftover resources but would be halted at any point (or pushed out of
the system) for some time.

> It also occurred to me that commercial application vendors tend to
> roll-their-own when it comes to integrating their applications with an
> MPI library. I've seen applications use HP-MPI, MPICH, MPICH2,
> Intel-MPI, (and thankfully, recently) OpenMPI and then proceed to
> butcher the execution mechanisms to such an extent that it makes
> integration with common DRM systems quite a task. With the
> exception of
> OpenMPI, none of these libraries provides turn-key compatibility with
> most of the major DRMs and each require some degree of manual
> integration and testing for use in a multi-user production
> environment.
> I would think that vendors would be falling over themselves to
> integrate
> OpenMPI with their applications for this very reason alone. Instead,
> some opt to develop their own scheduling environments! Don't they
> have
> bean counters that sit around and gripe about duplicated work?

I think there are some reasons: a) history - maybe their custom built
scheduling was already available at a time when there was no wide
spread use of a DRM, b) at that time it was one big machine with many
users and not the nowadays common clusters with nodes and maybe also
a third point c) due to limited resources they were on the opinion
that the user will use only their application and is also the only
user of a cluster. d) they wanted to provide a workflow solution,
even if someone don't like to install a queuing system just on a
local workstation (I install SGE even local on each users machine for
small things - same syntax like in the cluster, and their machines
won't get overloaded - but I'm sure it's not common practice).

But you are right, this leads to a situation where you have to
combine two queuingsystems. Let's say one example: the applications
from Schrodinger. When you have machines with only their software,
then you can teach your users to use their commands. When you want to
use a DRM anyway because you have other applications and groups of
users: there are hocks available to forwards Schrodinger's "foobar
kill" to a "qdel" to integrate it with various queuingsystems. But
this means the users have to think about: I kill job type A and D
with "qdel", but for B and C I have to use "foobar kill" - not to
mention that you have two <jobid>s to handle. For now I failed to
leave their queuing system out and start the jobs directly.

> Then it occurred to me: with the exception of being able to easily
> launch an MPI job with OpenMPI, the ability to monitor it from within
> the application is still dependent on the vendor integrating with
> various DRMs! This is another area where a DRMAA RAS can come in
> handy.
> There are nice bindings for monitoring tasks and getting an idea of
> where you are in execution without having to resort to kludgey
> shell-script wrappers tailing output files.
> Anyway, its been a frustrating couple of weeks dealing with several
> commercial vendors and integrating their applications with

Yeah, I know...

-- Reuti

> our DRM and
> my mind has been trying to think of a solution that could save all
> of us
> a lot of work (though, at the same time, raise job security
> concerns in
> such turbulent times ;-/ ). What say you, MPI experts?
> Many thanks for your thoughts!
> -Brian
> _______________________________________________
> users mailing list
> users_at_[hidden]