Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] FW: slurm and all-srun orterun
From: Sacerdoti, Federico (Federico.Sacerdoti_at_[hidden])
Date: 2008-03-06 14:39:14


Ralph, here is Moe's response. The srun options he mentions look
promising: they can signal an otherwise happy orted daemon (sitting on a
waitpid) that something is amiss elsewhere in the job. Do orteds change
their session ID?

Thanks Moe,
Federico

-----Original Message-----
From: jette1_at_[hidden] [mailto:jette1_at_[hidden]]
Sent: Wednesday, March 05, 2008 2:21 PM
To: Sacerdoti, Federico; Open MPI Users
Subject: RE: [OMPI users] slurm and all-srun orterun

Slurm and its APIs are available under the GPL license.
Since Open MPI is not available under the GPL license it
can not link with the Slurm APIs, however virtually all
of that API functionality is available through existing
Slurm commands. The commands are clearly not as simple to
use as the APIs, but if you encounter any problems using
the commands we can certainly make changes to facilitate
their use. For example, Slurm communicates with the Maui
and Moab schedulers using an interface that loosely
resembles XML. We are also prepared to provide additional
functionality as needed by OpenMPI.

Regarding premature termination of processes that Slurm
spawns, the srun command has a couple of option that may
prove useful:

-K, --kill-on-bad-exit
      Terminate a job if any task exits with a non-zero exit code.

-W, --wait=seconds
      Specify how long to wait after the first task terminates before
      terminating all remaining tasks. A value of 0 indicates an
      unlimited wait (a warning will be issued after 60 seconds). The
      default value is set by the WaitTime parameter in the slurm
      configuration file (see slurm.conf(5)). This option can be use-
      ful to insure that a job is terminated in a timely fashion in
      the event that one or more tasks terminate prematurely.

Any tasks launched outside of Slurm's control (e.g. rsh) are not
purged on job termination. Slurm locates spawned tasks and any of
their children using the configured ProcTrack plugin, of which
several are available. If you use the SID (session ID) plugin
and spawned tasks change their SID, Slurm will no longer track
them. Several reliable process tracking mechanisms are available,
but some do require kernel changes. See "man slurm.conf" for more
information.

Moe

At 11:16 AM -0500 3/5/08, Sacerdoti, Federico wrote:
>Thanks Ralph,
>
>First, we would be happy to test the slurm direct launch capability.
>Regarding the failure case, I realize that the IB errors do not
directly
>affect the orted daemons. This is what we observed:
>
>1. Parallel job started
>2. IB errors caused some processes to fail (but not all)
>3. slurm tears down entire job, attempting to kill all orted and their
>children
>
>We want this behavior: if any process of a parallel job dies, all
>processes should be stopped. The orted daemons in charge of processes
>that did not fail are the problem, as slurm was not able to kill them.
>Sounds like this is a known issue in openmpi 1.2.x.
>
>In any case, the new direct launching methods sound promising. I am
>surprised there are licensing issues with Slurm, is this a GPL-and-BSD
>issue? I am CC'ing slurm author Moe; he may be able to help.
>
>Thanks again and I look forward to testing the direct launch,
>Federico
>
>
>-----Original Message-----
>From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>Behalf Of Ralph Castain
>Sent: Monday, March 03, 2008 8:19 PM
>To: Open MPI Users <users_at_[hidden]>
>Cc: Ralph Castain
>Subject: Re: [OMPI users] slurm and all-srun orterun
>
>Hello
>
>I don't monitor the user list any more, but a friendly elf sent this
>along
>to me.
>
>I'm not entirely sure what problem might be causing the behavior you
are
>seeing. Neither mpirun nor any orted should be impacted by IB problems
>as
>they aren't MPI processes and thus never interact with IB. Only
>application
>procs touch the IB subsystem - if an application proc fails, the orted
>should see that and correctly order the shutdown of the job. So if you
>are
>having IB problems, that wouldn't explain daemons failing.
>
>If a daemon is aborting, that will cause problems in 1.2.x. We have
>noted
>that SLURM (even though the daemons are launched via srun) doesn't
>always
>tell us when this happens, leaving Open MPI vulnerable to "hangs" as it
>attempts to cleanup and finds it can't do it. I'm not sure why you
would
>see
>a daemon die, though - the fact that an application process failed
>shouldn't
>cause that to happen. Likewise, it would seem strange that the
>application
>process would fail and the daemon not notice - this has nothing to do
>with
>slurm, but is just a standard Linux "waitpid" method.
>
>The most likely reason for the behavior you describe is that an
>application
>process encounters an IB problem which blocks communication - but the
>process doesn't actually abort or terminate, it just hangs there. In
>this
>case, the orted doesn't see the process exit, so the system doesn't
know
>it
>should take any action.
>
>That said, we know that 1.2.x has problems with clean shutdown in
>abnormal
>situations. Release 1.3 (when it comes out) addresses these issues and
>appears (from our testing, at least) to be much more reliable about
>cleanup.
>You should see a definite improvement in the detection of process
>failures
>and subsequent cleanup.
>
>As for your question, I am working as we speak on two new launch modes
>for
>Open MPI:
>
>1. "direct" - this uses mpirun to directly launch the application
>processes
>without use of the intermediate daemons.
>
>2. "standalone" - this uses the native launch command to simply launch
>the
>application processes, without use of mpirun or the intermediate
>daemons.
>
>The initial target environments for these capabilities are TM and
SLURM.
>The
>latter poses a bit of a challenge as we cannot use their API due to
>licensing issues, so it will come a little later. We have a design for
>getting around the problem - the ordering is more driven by priorities
>then
>anything technical.
>
>The direct launch capability -may- be included in 1.3 assuming it can
be
>completed in time for the release. If not, it will almost certainly be
>in
>1.3.1. I'm expecting to complete the TM version in the next few days,
>and
>perhaps get the SLURM version working sometime this month - but they
>will
>need validation before being included in an official release.
>
>I can keep you posted if you like - once this gets into our repository,
>you
>are certainly welcome to try it out. I would welcome feedback on it.
>
>Hope that helps
>Ralph
>
>
>>> From: "Sacerdoti, Federico" <Federico.Sacerdoti_at_[hidden]>
>>> Date: March 3, 2008 12:44:39 PM EST
>>> To: "Open MPI Users" <users_at_[hidden]>
>>> Subject: [OMPI users] slurm and all-srun orterun
>>> Reply-To: Open MPI Users <users_at_[hidden]>
>>>
>>> Hi,
>>>
>>> We are migrating to openmpi on our large (~1000 node) cluster, and
>>> plan
>>> to use it exclusively on a multi-thousand core infiniband cluster
in
>>> the
>>> near future. We had extensive problems with parallel processes not
>>> dying
>>> after a job crash, which was largely solved by switching to the
slurm
>>> resource manager.
>>>
>>> While orterun supports slurm, it only uses the srun facility to
>launch
>>> the "orted" daemons, which then start the actual user process
>>> themselves. In our recent migration to openmpi, we have noticed
>>> occasions where orted did not correctly clean up after a parallel
job
>>> crash. In most cases the crash was due to an infiniband error. Most
>>> worryingly slurm was not able to cleanup the orted, and it along
with
>>> user processes were left running.
>>>
>>> At SC07 I was told that there is some talk of using srun to launch
>>> both
>>> orted and user processes, or alternatively use srun only. Either
>would
>>> solve the cleanup problem, in our experience. Is Rolf Castain on
this
>>> list?
>>>
>>> Thanks,
>>> Federico
>>>
>>> P.S.
>>> We use proctrack/linuxproc slurm process tracking plugin. As noted
in
>>> the config man page, this may fail to find certain processes and
>>> explain
>>> why slurm could not clean up orted effectively.
>>>
>>> man slurm.conf(5), version 1.2.22:
>>> NOTE: "proctrack/linuxproc" and "proctrack/pgid" can fail to
identify
>>> all processes associated with a job since processes can become a
>child
>>> of the init process (when the parent process terminates) or change
>>> their
>>> process group. To reliably track all processes, one of the other
>>> mechanisms utilizing kernel modifications is preferable.
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Morris "Moe" Jette       jette1_at_[hidden]                 925-423-4856
Integrated Computational Resource Management Group   fax 925-423-6961
Livermore Computing            Lawrence Livermore National Laboratory
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
"The problem with the world is that we draw the circle of our family
  too small."  - Mother Teresa
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++