Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI/ORTE and tools
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-01-22 10:19:01


On 1/19/08 6:31 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:

> Ralph --
>
> I'm a little confused as to what you're providing. In all 3 of the
> scenarios you describe, you're saying that the external tool can
> connect to the HNP for one or more jobs and execute a few discrete
> functions:
>
> - find procs and/or jobs running under that HNP
> - querying status of procs and/or jobs
> - querying status of nodes
> - spawning a new job
> - terminating a job
>

Actually, that isn't quite correct - sorry for confusion. What I was trying
to say was that you could connect via a simple wire protocol (scenario #1)
to pass a few discrete commands and queries to an existing mpirun (and/or
persistent virtual machine). The HNP "listens" on the same daemon command
socket that it always opens, so there is no "new" socket for this
functionality.

The advantages of this approach are: (a) the tool only calls simple library
functions to pass commands/queries to the HNP and get answers back. Any
changes in APIs within ORTE are now totally hidden from the tool; (b) the
size of the required comm library is much smaller than all of ORTE, so the
tool gets a smaller memory footprint; (c) the tool "lives" totally
independently of the application, so you can quit (and later restart and
reconnect) the tool without disturbing the application.

Disadvantages are: (a) you only get access to a limited set of queries
and/or commands - what I was requesting was input on commands people would
like that I might have missed; and (b) the mpirun and/or virtual machine
must be started separately before the tool can connect to them (however, the
tool can be started first and simply be told to "look for an mpirun" after
the mpirun is issued).

Scenario #2 is identical to what we have in the code releases today. In this
mode, the tool calls "orte_init" and sets itself up as an HNP. It then uses
the ORTE API's to execute the commands - e.g., calling orte_plm.spawn to
launch the specified application. The tool can also launch any distributed
"probes" (i.e., processes needed by the tool but not part of the application
- e.g., to monitor an application's resource usage) on the backend nodes, if
desired, by simply calling orte_plm.spawn for a second "app" that consists
of the probe executable.

Advantages: full access to all ORTE functionality and internal data

Disadvantages: (a) the tool's code may have to be updated to follow changes
in ORTE internal APIs; (b) the tool must stay alive throughout execution of
the application.

Scenario #3 is somewhat of a combination of the prior two. If you invoke
mpirun to launch an application into the background, you can subsequently
invoke mpirun again to launch a set of distributed "probes" (as described
above) to monitor that application. In this case, you could (if desired)
have one or more of the "probe" processes contact the HNP via the simple
wire protocol to issue commands. Or you could just have the processes report
(via stdout or files) whatever info they are monitoring.

The point in this scenario was mainly to show that you could launch a
distributed tool without dealing with the ORTE interfaces - the tool's procs
can either just do their own thing, or can use the wire protocol to
communicate with the application's HNP. In this case, the tool is again
independent of the application being monitored, so you could stop and
restart/reconnect it without affecting anything.

These were just a response to some concerns expressed about tools dealing
with changing APIs. The wire protocol removes that necessity/annoyance, with
some (hopefully minor) limits on capability. What people had wanted from a
tool was the ability to spawn jobs, spawn distributed "probes", and query
status of jobs/nodes/procs. I have provided that capability - just not sure
if there is more they would like to see.

Hope that helps
Ralph

> I can see how this maps into scenario #1, but I don't quite understand
> scenarios #2 and #3. Is there a new API for this functionality, or is
> there a simple wire protocol that is used to connect to the HNP and
> send these commands? Does the HNP listen on a new socket for these
> commands? I.e., how does it work?
>
>
> On Jan 16, 2008, at 8:47 AM, Ralph Castain wrote:
>
>> Hello all
>>
>> Summary: this note provides a brief overview of how various tools can
>> interface to OMPI applications once the next version of ORTE is
>> integrated
>> into the trunk. It includes a request for input regarding any needs
>> (e.g.,
>> additional commands to be supported in the interface) that have not
>> been
>> adequately addressed.
>>
>> As many of you know, I have been working on a tmp branch to complete
>> the
>> revamp of ORTE that has been in progress for quite some time. Among
>> other
>> things, this revamp is intended to simplify the system, provide
>> enhanced
>> scalability, and improved reliability.
>>
>> As part of that effort, I have extensively revised the support for
>> external
>> tools. In the past, tools such as the Eclipse PTP could only
>> interact with
>> Open MPI-based applications via ORTE API's, thus exposing the tool
>> to any
>> changes in those APIs. Most tools, however, do not require the level
>> of
>> control provided by the APIs and can benefit from a simplified
>> interface.
>>
>> Accordingly, the revamped ORTE now offers alternative methods of
>> interaction. The primary change has been the creation of a
>> communications
>> library with a simple serial protocol for interacting with OMPI
>> jobs. Thus,
>> tools now have three choices for interacting with OMPI jobs:
>>
>> 1. I have created a new communications library that tools can link
>> against.
>> It does not include all of the ORTE or OMPI libraries, so it is a
>> very small
>> memory footprint. Besides the usual calls to initialize and
>> finalize, the
>> library contains utilities for finding all of the OMPI jobs running
>> on that
>> HNP (i.e., all OMPI jobs whose mpirun was executed from that host),
>> querying
>> the status of a job (provides the job map plus all proc states);
>> querying
>> the status of nodes (provides node names, status, and list of procs
>> on each
>> node including their state); querying the status of a specific
>> process;
>> spawning a new job; and terminating a job. In addition, you can
>> attach to
>> output streams of any process, specifying stdout, stderr, or both -
>> this
>> "tees" the specified streams, so it won't interfere with the job's
>> normal
>> output flow.
>>
>> I could also create a utility to allow attachment to the input
>> stream of a
>> process. However, I'm a little concerned about possible conflicts with
>> whatever is already flowing across that stream. I would appreciate any
>> suggestions as to whether or not to provide that capability.
>>
>> Note: we removed the concept of the ORTE "universe", so a tool can
>> now talk
>> to any mpirun without complications. Thus, tools can simultaneously
>> "connect" to and monitor multiple mpiruns, if desired.
>>
>>
>> 2. link against all of OMPI or ORTE, and execute a standalone
>> program. In
>> this mode, your tool would act as a surrogate for mpirun by directly
>> spawning the user's application. This provides some flexibility, but
>> it does
>> mean that both the tool and the job -must- end together, and that
>> the tool
>> may need to be revised whenever OMPI/ORTE APIs are updated.
>>
>>
>> 3. link against all of OMPI or ORTE, executing as a distributed set of
>> processes. In this mode, you would execute your tool via "mpirun -
>> pernode
>> ./my_tool" (or whatever command is appropriate - this example would
>> launch
>> one tool process on every node in the allocation). If the tool
>> processes
>> need to communicate with each other, they can call MPI_Init or
>> orte_init,
>> depending upon the level of desired communication. Note that the
>> tool job
>> will be completely standalone from the application job and must be
>> terminated separately.
>>
>>
>> In all of these cases, it is possible for tool processes to connect
>> (via MPI
>> and/or ORTE-RML) to a job's processes provided that the application
>> supports
>> it.
>>
>> I can provide more details, of course, to anyone wishing them. What
>> I would
>> appreciate, though, is any feedback about desired commands, mode of
>> operation, etc. that I might have missed or people would prefer be
>> changed.
>> This code is all in a private repository for my tmp branch, but I
>> expect
>> that to merge with the trunk fairly soon. I have provided a couple of
>> example tools to illustrate the above modes of operation in that code.
>>
>> Thanks
>> Ralph
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>