Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-10-21 09:08:14

On Fri, 2005-10-21 at 12:41 +0400, Konstantin Karganov wrote:

> > You and Chris G. raise a good point -- another parallel debugger vendor
> > has contacted me about the same issue (their debugger does not have an
> > executable named "totalview").
> > <...>
> > Comments?
> Actually, the point is deeper than just a debugger naming question.

Understood. My points here were that for at least some debuggers, a
naming scheme is all they need, and we should be able to accommodate

> High-quality MPI implementation should provide more flexibility to the
> user.

Heh. Are you calling us a low-quality implementation? ;-)

> The debuggers may differ by startup algorithms: MPICH (the same
> example) allows to write arbitrary script for starting the custom
> debugger. And it works fine launching this script in run-time (there is a
> naming convention), w/o need to rebuild and reinstall the library.

Understood. However, our startup philosophy is quite different than
MPICH's; having a compiled executable as the starter has many more
benefits than problems (IMHO). You have concretely identified a problem
-- that there is no flexibility in different debugger bootstrap
mechanisms -- and a) I agree, b) I think we can fix it easily, and c) I
would prefer not to revert to scripts as the only solution.

This was the intent of my comments about making a component framework
for debugger bootstrapping -- the TotalView (and TotalView-like)
debugger support can go in one component and "something else" can go in
other components.

> Actually all I need is the same, that orte already does:
> 1. Launch the processes on all nodes
> 2. Make sure they are successfully launched.
> 3. Get the array of handles to read/write to each process
> 4. Be able to stop the processes
> 5. Probably send signals to processes (gdb uses SIGINT to interrupt
> execution)
> 6. Probably have the info about node names and PIDs to display it and to
> implement pp.4-5
> Looks just the same as for usual run, but the devil is surely in the
> details.

Yes, it is quite similar -- there are a few minor differences, though.
We don't currently support sending arbitrary signals (it certainly can
be done -- at least in some environments -- we just haven't had a need
for that yet), and IMHO, it would be nice if ORTE could handle "launch
this job alongside that other job" bookkeeping for you, so that you
don't need to specify all the location/process placement stuff.

> > I think the two main things you want are:
> >
> > 1. the information about the MPI processes in the ORTE job of interest
> > (are you interested in handling MPI-2 dynamic situations?).
> Not yet. It is planned to support only MPI 1.2 for the first release.

Gotcha. Let us know when you're interested; there's a lot of unanswered
questions in that arena.

> > 2. <..>
> I also might want 3. Get the knowlwdge "how it works" to be able to play
> with the code myself :)

Excellent. :-)

> > TV's view of the world is to ave "one" master debugger that controls all
> > the processes, so having a separate "starter" process in addition to the
> > MPI processes was no big deal.
> I'm trying to do the same way - attach gdb to each process as a node
> debugger and connect all this to the main debugger process, that has GUI
> and implements all "parallel" logic.
> The question was merely how to do it: call "gdb orterun" and catch it
> somewhere on breakpoint or attach to orterun later or smth else.

Yes -- per your other mail:

> Reply to myself:
> # gdb orterun
> (gdb) br MPIR_Breakpoint
> (gdb) run
> (gdb) <get the table>
> (gdb) detach
> (gdb) exit

> Am I right?

Essentially, yes. We also need to set an MCA param that gets propagated
out to all the MPI processes telling them to block in MPI_INIT until the
debugger attaches and changes a value (see
ompi/debuggers/ompi_totalview.c -- it's called at the very end of
MPI_INIT, in ompi/runtime/ompi_mpi_init.c).

> > if you're actually integrated in as a component, then you could get the
> > information directly (i.e., via API)...? The possibilities here are
> > open.
> This also sounds interesting.

This is the primary route that I'd like to follow. We had always
envisioned doing this; you've giving us a concrete reason to do so. :-)
Our general rule of implementation in Open MPI is "if we ever want to
implement something multiple different ways, make it a framework and
write components).

{+} Jeff Squyres
{+} The Open MPI Project