Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-07-28 09:04:11

That's fine - please review the standard and let us know.

Meantime, let me explain how the persistent daemon operations currently work
and some of the problems that drove us there (and continue to plague that

First, it is critical to understand the following point: all allocation
operations are performed -solely- at the HNP. There was a time when we
performed them at mpirun and passed the results to the HNP. However, this
caused a lot of trouble in managed environments as users couldn't keep
straight what their jobs were doing across multiple login sessions. I'll
explain more about that later in this note. For now, just keep in your mind
that the reading of any allocation is done at the HNP.

It is easiest to explain how things currently operate by considering the
following use-case. A user logs into a managed environment, obtains an
allocation (we'll call it allocation A), and starts a persistent daemon

orted --persistent --seed --scope public --universe foo

The orted starts up and daemonizes, but it does -not- read the local
allocation at this time because it hasn't been ordered to launch anything
yet. Accordingly, the node segment of the GPR is empty.

Next, the user executes (we'll call this mpirun-A):

mpirun --universe foo -np 10 ./my_app

Mpirun-A connects to the persistent orted, and then - and this is critical -
selects only the proxy components for the RDS, RAS, RMAPS, PLS, and RMGR
frameworks. It then begins to execute the launch sequence, which - since it
it is using the proxy components - consists of nothing more than sending a
sequence of commands to the persistent daemon.

For the interests of this discussion, the command sequence begins with an
order to run the RDS framework. The persistent orted checks to see if there
are any hostfiles on -its- command line. Let me emphasize that point here -
it does -not- know if the user put a -hostfile option on the mpirun command
line! It only looks at its own MCA param, which means it will look in its
environment, command line, or a default hostfile location. Any hostfile
specified on the mpirun command line -is ignored-.

In this use case, the persistent orted does not see a hostfile, so the RDS
does nothing. Mpirun-A then orders the execution of the RAS framework. The
persistent orted checks to see if any nodes are on the node segment. If
there are, it would simply stop right there and do nothing. In this case,
the segment is empty, so it runs the local RM component and gets
allocation-A, which it then stores on the node segment.

Mpirun-A now proceeds to execute the remainder of the launch, and the my_app
procs start.

Okay, now the user opens a new window and logs back in to the same machine.
For fun, let us assume they get another allocation in this login session
from the RM (call it allocation B). They now execute:

mpirun --universe foo -np 50 -hostfile bar ./my_other_app

We'll call this mpirun-B. So what happens here? Well, mpirun-B will connect
to the persistent orted running on this system since we told it to do so. It
will therefore select only the proxy components, and then initiate the

First step will be to command the persistent orted to run the RDS framework.
The orted will look at its MCA param and not see a hostfile, so nothing is
done. Mpirun-B will then order the orted to run the RAS framework - and here
is where the fun begins. Because there already are nodes on the node
segment, the RAS -does not execute any components-. Thus, the allocation in
the node segment remains unchanged. This mpirun-B will therefore launch the
my_other_app procs on the same nodes being used by my_app!

The problem has always been that there is no consistent treatment of this
use-case across the various MPI's out there. We have, therefore, been caught
in that constant struggle between "camps" that understandably want to
minimize any changes in behavior that their users have to absorb.

Our first discussion was about the use of a "lamboot-like" command to setup
a virtual machine, and then require that all mpirun's could only execute
within that VM. This would solve the above confusion, but there was general
disapproval of the lamboot-first methodology. Accordingly, we rejected that
approach, and - after a lot of discussion - settled on the idea of creating
a "universe" to which mpirun's from any login session could connect.

Initially, we would read the allocation at the mpirun location, and then
simply communicate it to the HNP (in this case, that would be the persistent
orted). However, while this worked fine for hostfile-based systems, this
caused problems in managed environments as it led to confusion over which
allocation was being used in the above use-case. If instead of mpirun-A
executing first, the user in the prior example had executed mpirun-B first,
then we would have wound up with everything executing in allocation B.

We then fixed that problem by tracking allocations according to jobid's.
Since each mpirun had a separate jobid, we could just say "you run your job
inside your own allocation". That solved the timing issue, but opened
another set of problems.

(a) the orted was only alive while the session within which it was created
existed. So, if mpirun-A finished first and the user logged out (or the
system logged them out due to inactivity), then mpirun-B lost its HNP! We
had - and still do -not- have - no reliable way of detecting and responding
to the loss of an HNP. Thus, more often than not, we wound up with stranded
processes in job-B and lots of complaints about cleanup.

(b) the bookkeeping became very complex when we started dealing with
dynamics. For example, are comm_spawn'd children required to execute solely
within the allocation of their parents? That would make sense, but now I
have to track job genealogy so I know who is a child of whom, so I can
assign them to the right allocation.

(c) what to do about hostfile and -host. That discussion is in another
thread - it gets uglier in this mode.

(d) how to detect and handle multiple reads of the same allocation. Suppose
that mpirun-B had been in the same login session as mpirun-A. If I read the
allocation within mpirun and pass it to the orted, now the orted will get
the same allocation sent to it that it got from mpirun-A - yet the number of
slots in that allocation didn't actually change! How do I know this is the
-same- allocation, and that I should just ignore the second set of
information? If I just assign the "new" allocation to job-B as if it were
totally independent, then I will map my_other_app's procs right no top of
my_app's procs since I will have no idea that those slots were already in

So we fixed those problems by dictating that the allocation is only read
-once-, and only at the HNP itself. Note that this doesn't solve (a) - this
remains a consistent problem on managed environments which has only been
mitigated by the fact that few people (to date) have been using
connect/accept with Open MPI (those that do have been bitten and warned).

Is the system perfect and doing what it should? Obviously not. However, that
isn't because we were "stupid" or "not thinking" - it is because the desired
behavior has never been clearly defined, there are multiple camps that have
strong and contradictory opinions, and there are technical problems with
just about any choice you make.

What I was hoping to do with this RFC is kickoff this discussion. How -do-
you want this to work? Do we forbid cross-connection of mpiruns in different
login sessions? Note that this was a "mandatory" requirement when we
started, but perhaps - in the light of experience - it should be
reconsidered. How do you want to handle hostfile and -host in the persistent
daemon scenario described above? Etc.

Please let me know - to be honest, I am rather tired of going around in
circles on this behavior.

Thanks - and I hope that helps explain why "fixing" persistent orted is
hardly straightforward.


PS. As to why mpirun can't be run in the background, we should start another
thread on that issue. In brief, it is a combination of problems with the
event library, progress engine, and IOF.

Finally, we

On 7/27/07 4:42 PM, "Aurelien Bouteiller" <bouteill_at_[hidden]> wrote:

> I basically agree with Terry, even if your proposal would solve all
> the issue I currently face. I think we need to read the MPI2 standard
> to make sure we are not on the brink of breaking the standard.
> Aurelien
> On Jul 27, 2007, at 10:13 , Ralph Castain wrote:
>> On 7/27/07 7:58 AM, "Terry D. Dontje" <Terry.Dontje_at_[hidden]> wrote:
>>> Ralph Castain wrote:
>>>> WHAT: Proposal to add two new command line options that will
>>>> allow us to
>>>> replace the current need to separately launch a persistent
>>>> daemon to
>>>> support connect/accept operations
>>>> WHY: Remove problems of confusing multiple allocations,
>>>> provide a cleaner
>>>> method for connect/accept between jobs
>>>> WHERE: minor changes in orterun and orted, some code in rmgr and
>>>> each pls
>>>> to ensure the proper jobid and connect info is passed to each
>>>> app_context as it is launched
>>> It is my opinion that we would be better off attacking the issues of
>>> the persistent daemons described below then creating a new set of
>>> options to mpirun for process placement. (more comments below on
>>> the actual proposal).
>> Non-trivial problems - we haven't figured them out in three years of
>> occasional effort. It isn't clear that they even -can- be solved when
>> considering the problem of running in multiple RM-based allocations.
>> I'll try to provide more detail on the problems when I return from
>> my quick
>> trip...
> _______________________________________________
> devel mailing list
> devel_at_[hidden]