Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Terry D. Dontje (Terry.Dontje_at_[hidden])
Date: 2007-07-27 09:58:55

Ralph Castain wrote:

>WHAT: Proposal to add two new command line options that will allow us to
> replace the current need to separately launch a persistent daemon to
> support connect/accept operations
>WHY: Remove problems of confusing multiple allocations, provide a cleaner
> method for connect/accept between jobs
>WHERE: minor changes in orterun and orted, some code in rmgr and each pls
> to ensure the proper jobid and connect info is passed to each
> app_context as it is launched
It is my opinion that we would be better off attacking the issues of
the persistent daemons described below then creating a new set of
options to mpirun for process placement. (more comments below on
the actual proposal).

>TIMOUT: 8/10/07
>We currently do not support connect/accept operations in a clean way. Users
>are required to first start a persistent daemon that operates in a
>user-named universe. They then must enter the mpirun command for each
>application in a separate window, providing the universe name on each
>command line. This is required because (a) mpirun will not run in the
>background (in fact, at one point in time it would segfault, though I
>believe it now just hangs), and (b) we require that all applications using
>connect/accept operate under the same HNP.
>This is burdensome and appears to be causing problems for users as it
>requires them to remember to launch that persistent daemon first -
>otherwise, the applications execute, but never connect. Additionally, we
>have the problem of confused allocations from the different login sessions.
>This has caused numerous problems of processes going to incorrect locations,
>allocations timing out at different times and causing jobs to abort, etc.
>What I propose here is to eliminate the confusion in a manner that minimizes
>code complexity. The idea is to utilize our so-painfully-developed multiple
>app_context capability to have the user launch all the interacting
>applications with the same mpirun command. This not only eliminates the
>annoyance factor for users by eliminating the need for multiple steps and
>login sessions, but also solves the problem of ensuring that all
>applications are running in the same allocation (so we don't have to worry
>any more about timeouts in one allocation aborting another job).
>The proposal is to add two command line options that are associated with a
>specific app_context (feel free to redefine the name of the option - I don't
>personally care):
>1. --independent-job - indicates that this app_context is to be launched as
>an independent job. We will assign it a separate jobid, though we will map
>it as part of the overall command (e.g., if by slot and no other directives
>provided, it will start mapping where the prior app_context left off)
I am unclear what does the option --connect really do? The MPI codes
have to call MPI_Comm_connect to really connect to a process. Can we
get away
with just the above option?

>2. --connect x,y,z - only valid when combined with the above option,
>indicates that this independent job is to be MPI-connected to app_contexts
>x,y,z (where x,y,z are the number of the app_context, counting from the
>beginning of the command - you choose if we start from 0 or 1).
>Alternatively, we can default to connecting to everyone, and then use
>--disconnect to indicate we -don't- want to be connected.
>Note that this means the entire allocation for the combined app_contexts
>must be provided. This helps the RTE tremendously to keep things straight,
>and ensures that all the app_contexts will be able to complete (or not) in a
>synchronized fashion.
>It also allows us to eliminate the persistent daemon and multiple login
>session requirements for connect/accept. That does not mean we cannot have a
>persistent daemon to create a virtual machine, assuming we someday want to
>support that mode of operation. This simply removes the requirement that the
>user start one just so they can use connect/accept.
>devel mailing list