Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI/ORTE and tools
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-01-23 14:28:58

Gotcha; thanks for the explanation.

The capabilities you added sounds good for the moment; I'm sure we'll
think of more over time...

On Jan 22, 2008, at 10:19 AM, Ralph H Castain wrote:

> On 1/19/08 6:31 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>> Ralph --
>> I'm a little confused as to what you're providing. In all 3 of the
>> scenarios you describe, you're saying that the external tool can
>> connect to the HNP for one or more jobs and execute a few discrete
>> functions:
>> - find procs and/or jobs running under that HNP
>> - querying status of procs and/or jobs
>> - querying status of nodes
>> - spawning a new job
>> - terminating a job
> Actually, that isn't quite correct - sorry for confusion. What I was
> trying
> to say was that you could connect via a simple wire protocol
> (scenario #1)
> to pass a few discrete commands and queries to an existing mpirun
> (and/or
> persistent virtual machine). The HNP "listens" on the same daemon
> command
> socket that it always opens, so there is no "new" socket for this
> functionality.
> The advantages of this approach are: (a) the tool only calls simple
> library
> functions to pass commands/queries to the HNP and get answers back.
> Any
> changes in APIs within ORTE are now totally hidden from the tool;
> (b) the
> size of the required comm library is much smaller than all of ORTE,
> so the
> tool gets a smaller memory footprint; (c) the tool "lives" totally
> independently of the application, so you can quit (and later restart
> and
> reconnect) the tool without disturbing the application.
> Disadvantages are: (a) you only get access to a limited set of queries
> and/or commands - what I was requesting was input on commands people
> would
> like that I might have missed; and (b) the mpirun and/or virtual
> machine
> must be started separately before the tool can connect to them
> (however, the
> tool can be started first and simply be told to "look for an mpirun"
> after
> the mpirun is issued).
> Scenario #2 is identical to what we have in the code releases today.
> In this
> mode, the tool calls "orte_init" and sets itself up as an HNP. It
> then uses
> the ORTE API's to execute the commands - e.g., calling
> orte_plm.spawn to
> launch the specified application. The tool can also launch any
> distributed
> "probes" (i.e., processes needed by the tool but not part of the
> application
> - e.g., to monitor an application's resource usage) on the backend
> nodes, if
> desired, by simply calling orte_plm.spawn for a second "app" that
> consists
> of the probe executable.
> Advantages: full access to all ORTE functionality and internal data
> Disadvantages: (a) the tool's code may have to be updated to follow
> changes
> in ORTE internal APIs; (b) the tool must stay alive throughout
> execution of
> the application.
> Scenario #3 is somewhat of a combination of the prior two. If you
> invoke
> mpirun to launch an application into the background, you can
> subsequently
> invoke mpirun again to launch a set of distributed "probes" (as
> described
> above) to monitor that application. In this case, you could (if
> desired)
> have one or more of the "probe" processes contact the HNP via the
> simple
> wire protocol to issue commands. Or you could just have the
> processes report
> (via stdout or files) whatever info they are monitoring.
> The point in this scenario was mainly to show that you could launch a
> distributed tool without dealing with the ORTE interfaces - the
> tool's procs
> can either just do their own thing, or can use the wire protocol to
> communicate with the application's HNP. In this case, the tool is
> again
> independent of the application being monitored, so you could stop and
> restart/reconnect it without affecting anything.
> These were just a response to some concerns expressed about tools
> dealing
> with changing APIs. The wire protocol removes that necessity/
> annoyance, with
> some (hopefully minor) limits on capability. What people had wanted
> from a
> tool was the ability to spawn jobs, spawn distributed "probes", and
> query
> status of jobs/nodes/procs. I have provided that capability - just
> not sure
> if there is more they would like to see.
> Hope that helps
> Ralph
>> I can see how this maps into scenario #1, but I don't quite
>> understand
>> scenarios #2 and #3. Is there a new API for this functionality, or
>> is
>> there a simple wire protocol that is used to connect to the HNP and
>> send these commands? Does the HNP listen on a new socket for these
>> commands? I.e., how does it work?
>> On Jan 16, 2008, at 8:47 AM, Ralph Castain wrote:
>>> Hello all
>>> Summary: this note provides a brief overview of how various tools
>>> can
>>> interface to OMPI applications once the next version of ORTE is
>>> integrated
>>> into the trunk. It includes a request for input regarding any needs
>>> (e.g.,
>>> additional commands to be supported in the interface) that have not
>>> been
>>> adequately addressed.
>>> As many of you know, I have been working on a tmp branch to complete
>>> the
>>> revamp of ORTE that has been in progress for quite some time. Among
>>> other
>>> things, this revamp is intended to simplify the system, provide
>>> enhanced
>>> scalability, and improved reliability.
>>> As part of that effort, I have extensively revised the support for
>>> external
>>> tools. In the past, tools such as the Eclipse PTP could only
>>> interact with
>>> Open MPI-based applications via ORTE API's, thus exposing the tool
>>> to any
>>> changes in those APIs. Most tools, however, do not require the level
>>> of
>>> control provided by the APIs and can benefit from a simplified
>>> interface.
>>> Accordingly, the revamped ORTE now offers alternative methods of
>>> interaction. The primary change has been the creation of a
>>> communications
>>> library with a simple serial protocol for interacting with OMPI
>>> jobs. Thus,
>>> tools now have three choices for interacting with OMPI jobs:
>>> 1. I have created a new communications library that tools can link
>>> against.
>>> It does not include all of the ORTE or OMPI libraries, so it is a
>>> very small
>>> memory footprint. Besides the usual calls to initialize and
>>> finalize, the
>>> library contains utilities for finding all of the OMPI jobs running
>>> on that
>>> HNP (i.e., all OMPI jobs whose mpirun was executed from that host),
>>> querying
>>> the status of a job (provides the job map plus all proc states);
>>> querying
>>> the status of nodes (provides node names, status, and list of procs
>>> on each
>>> node including their state); querying the status of a specific
>>> process;
>>> spawning a new job; and terminating a job. In addition, you can
>>> attach to
>>> output streams of any process, specifying stdout, stderr, or both -
>>> this
>>> "tees" the specified streams, so it won't interfere with the job's
>>> normal
>>> output flow.
>>> I could also create a utility to allow attachment to the input
>>> stream of a
>>> process. However, I'm a little concerned about possible conflicts
>>> with
>>> whatever is already flowing across that stream. I would appreciate
>>> any
>>> suggestions as to whether or not to provide that capability.
>>> Note: we removed the concept of the ORTE "universe", so a tool can
>>> now talk
>>> to any mpirun without complications. Thus, tools can simultaneously
>>> "connect" to and monitor multiple mpiruns, if desired.
>>> 2. link against all of OMPI or ORTE, and execute a standalone
>>> program. In
>>> this mode, your tool would act as a surrogate for mpirun by directly
>>> spawning the user's application. This provides some flexibility, but
>>> it does
>>> mean that both the tool and the job -must- end together, and that
>>> the tool
>>> may need to be revised whenever OMPI/ORTE APIs are updated.
>>> 3. link against all of OMPI or ORTE, executing as a distributed
>>> set of
>>> processes. In this mode, you would execute your tool via "mpirun -
>>> pernode
>>> ./my_tool" (or whatever command is appropriate - this example would
>>> launch
>>> one tool process on every node in the allocation). If the tool
>>> processes
>>> need to communicate with each other, they can call MPI_Init or
>>> orte_init,
>>> depending upon the level of desired communication. Note that the
>>> tool job
>>> will be completely standalone from the application job and must be
>>> terminated separately.
>>> In all of these cases, it is possible for tool processes to connect
>>> (via MPI
>>> and/or ORTE-RML) to a job's processes provided that the application
>>> supports
>>> it.
>>> I can provide more details, of course, to anyone wishing them. What
>>> I would
>>> appreciate, though, is any feedback about desired commands, mode of
>>> operation, etc. that I might have missed or people would prefer be
>>> changed.
>>> This code is all in a private repository for my tmp branch, but I
>>> expect
>>> that to merge with the trunk fairly soon. I have provided a couple
>>> of
>>> example tools to illustrate the above modes of operation in that
>>> code.
>>> Thanks
>>> Ralph
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]

Jeff Squyres
Cisco Systems