Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Oleg Morajko (olegmorajko_at_[hidden])
Date: 2007-10-16 03:14:29


Thank you for your thoughts. Some more comments inlined.


On 10/9/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> Interesting idea.
> One obvious solution would be to mpirun your controller tasks and, as
> you mentioned, use MPI to communicate between them. Then you can use
> MPI_COMM_SPAWN to launch the actual MPI job that you want to monitor.

Well. Yes, it's certainly could be done, but would not work in my scenario.
As I said before,
I use dynamic instrumentation API (DynInst API) to control and instrument
MPI tasks.
DynInst is sort of a debugger, it uses ptrace() on Linux to control
processes. So I need to use dyninst API
to create a controlled process (and not fork() it or MPI_Spawn () it),or
eventually I could fork it, and later
attach (with DynInst) to a running process in order to to control it. In the
latter case however, I would loose control
over the first several seconds of execution.

However, this will only more-or-less work. OMPI currently polls
> aggressively to make message passing progress, so if you end up over-
> subscribing nodes (because you filled up the cores on one node with
> all the target MPI processes but also have 1 or more controller
> processes running on the same node), they'll thrash each other and
> you'll get -- at best -- unreliable/unrepeatable performance fraught
> with lots of race conditions.

This actually is a less serious issue than it seems. The daemon itself is a
very lightweight process. After executing the startup code (binary parsing,
process creation and instrumentation) it lets the MPI process go without any
additional overhead and than it sits waiting on certain events, so normally
the intrusion is less than 2%. The overhead of instrumentation inserted into
MPI task is controlled with a threshold and if placed reasonably stays low
(egg. not in a tight loop that executes lots of times, but on entry/exit of
let's say MPI_xxx comm calls).

Another issue is that OMPI's MPI_COMM_SPAWN does not give good
> options to allow specific process placement, so it might be a little
> dicey to get processes to land exactly where you want them.

Not an option, as daemon and task must sit on the same host. The best
scenario is dual-core host, one cpu per task and another per daemon.

Alternatively, you could simply locally fork()/exec() your target
> process from the controller. But the MPI spec does state that the
> use of fork() is undefined within an MPI process. Indeed, if you
> are using a high-speed network such as InfiniBand or Myrinet, calling
> fork() after you call MPI_INIT, Bad Things(tm) will happen (we can
> explain more if you care). But if you're only using TCP, you should
> be fine.

More less this is what I was doing. Daemon is mpirun, but it does not call
MPI_Init itself but DynInst-forks the mpi task that calls MPI_Init. I tested
this on OpenMPI using TCP/IP and Infiniband and MPICH and LAMMPI (on TCP)
and it worked.

Another option might be to mpirun your target MPI app, have it wait
> in some kind of local barrier, and then mpirun your controllers on
> the same machines. The controllers find/attach to your target
> processes, release them from the local barrier, and then you're good
> to go -- both your controllers and your target app are fully up and
> running under MPI. You'll still have the spinning/performance issue,
> though -- so you won't want to oversubscribe nodes.

Absolutely, this would be attach scenario for the daemons and they could use
MPI. Nice idea.
Unfortunately it would make the tool usage more complicated and their would
be no control on what happens during first several seconds.

Does this help?

Open-thinking always helps. Thank you.

Finally I decided not to use MPI for inter daemon communication, but opted
for MRNet infrastructure (multicast/reduction network,

On Oct 1, 2007, at 10:49 PM, Oleg Morajko wrote:
> > Hello,
> >
> > In the context of my PhD research, I have been developing a run-
> > time performance analyzer for MPI-based applications.
> > My tool provides a controller process for each MPI task. In
> > particular, when a MPI job is started, a special wrapper script is
> > generated that first starts my controller processes and next each
> > controller spawns an actual MPI task (that performs MPI_Init etc.).
> > I use dynamic instrumentation API (DynInst API) to control and
> > instrument MPI tasks.
> >
> > The point is I need to intercommunicate my controller processes, in
> > particular I need a point-to-point communication between arbitrary
> > pair of controllers. So it seems reasonable to take advantage of
> > MPI itself and use it for communication. However I am not sure what
> > would be the impact of calling MPI_Init and communicating from
> > controller processes taking into account both controllers and
> > actual MPI processes where started with the same mpirun
> > invocation. Actually I would need to assure that controllers have a
> > separate MPI execution enviroment while the application has another
> > one.
> >
> > Any suggestions how to achive that? Obviously another option is to
> > use sockets to intercommunicate controllers, but having MPI this
> > seems to be overkill.
> >
> > Thank you in advance for your help.
> >
> > Regards,
> > --Oleg
> >
> > PhD student, Universitat Autonoma de Barcelona, Spain
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> >
> --
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> users mailing list
> users_at_[hidden]