Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph H. Castain (rhc_at_[hidden])
Date: 2006-02-14 10:08:29


Hi Greg

I believe you may have been wiring it up originally because we didn't
have that service implemented at that time. We do have it all wired
up now - in fact, Brian has done some fairly important cleanup to the
system recently.

Since we complete the wiring upon notification of the INIT trigger, I
would not advise attaching yourself to that trigger - it could create
a race condition as to which of you (your callback or ours) got
called first. Instead, I would suggest attaching to the LAUNCHED
trigger, which occurs next in the sequence. This fires when the procs
actually are all launched, but before they initialize themselves
through mpi_init (assuming they do so).

If that doesn't work for you, I could create a subscription flag to
NOTIFY_ME_LAST that would ensure your callback occurred after any
others. This would resolve the race condition and allow you to use
the INIT trigger, but would take a little work on my part to
implement before you could use it.

Ralph

At 03:21 PM 2/13/2006, you wrote:
>I thought we were wiring up stdio ourselves because it wasn't being
>done in the spawn? If it's now being done by spawn then that's fine,
>but we need to be able to get called back when the I/O becomes
>available. How does this work?
>
>Greg
>
>On Feb 13, 2006, at 2:16 PM, Ralph H. Castain wrote:
>
> > Hmmmm....I wonder if this is going to create a problem?
> >
> > Tim/Brian/you io forwarding folks: This poses an interesting
> > question. We automatically wire up i/o forwarding in our spawn
> > routine. What happens when someone sets up their own i/o forwarding
> > callback and subsequently wires up stdio themselves? Does this
> > overwrite what we did, do processes receive duplicate copies, does it
> > generate an error, ...?
> >
> > I gather this is working for Nathan, and I don't claim to fully
> > understand what he is doing, but I'm curious as to what might happen
> > since I don't see anything in the system to prevent someone doing
> > this (not sure we could anyway).
> >
> > Ralph
> >
> >
> > At 02:32 PM 2/9/2006, you wrote:
> >> I've coded a hacky workaround in our code to get past this.
> >> Basically,
> >> I capture all of the state transitions and the first one fired for
> >> a job
> >> I fire the 'init' state internally in our tool. Generally this
> >> occurs
> >> for one of the gate transitions, G1 or something. It'll work this
> >> way.
> >>
> >> Furthermore, we're telling our users to get your 1.0.2a4 (or whatever
> >> 1.0.2 is available at the time).
> >>
> >> The way I coded it when you guys put this into the main branch and
> >> the
> >> INIT state resumes firing then my code will start working that much
> >> better. I really only brought it up because I felt it was a bug you
> >> might not have been aware of.
> >>
> >> Thanks all.
> >>
> >> -- Nathan
> >> Correspondence
> >> ---------------------------------------------------------------------
> >> Nathan DeBardeleben, Ph.D.
> >> Los Alamos National Laboratory
> >> Parallel Tools Team
> >> High Performance Computing Environments
> >> phone: 505-667-3428
> >> email: ndebard_at_[hidden]
> >> ---------------------------------------------------------------------
> >>
> >>
> >>
> >> Jeff Squyres wrote:
> >>> Nathan --
> >>>
> >>> Ralph and I talked about this and decided not to bring it over to
> >>> the
> >>> 1.0 branch -- the fix uses new functionality that exists on the
> >>> trunk
> >>> and not in the 1.0 branch. The fix could be re-crafted to use
> >>> existing functionality on the 1.0 branch (we're really trying to
> >>> only
> >>> put bug fixes on the 1.0 branch -- not any new functionality) -- but
> >>> we didn't know if you cared. :-)
> >>>
> >>> Do you mind if this fix stays on the trunk, or do you need it in the
> >>> v1.0 branch?
> >>>
> >>>
> >>>
> >>> On Feb 8, 2006, at 4:36 PM, Nathan DeBardeleben wrote:
> >>>
> >>>
> >>>> Thanks Ralph.
> >>>>
> >>>> -- Nathan
> >>>> Correspondence
> >>>> -------------------------------------------------------------------
> >>>> --
> >>>> Nathan DeBardeleben, Ph.D.
> >>>> Los Alamos National Laboratory
> >>>> Parallel Tools Team
> >>>> High Performance Computing Environments
> >>>> phone: 505-667-3428
> >>>> email: ndebard_at_[hidden]
> >>>> -------------------------------------------------------------------
> >>>> --
> >>>>
> >>>>
> >>>>
> >>>> Ralph H. Castain wrote:
> >>>>
> >>>>> Nathan
> >>>>>
> >>>>> This should now be fixed on the trunk. Once it is checked out more
> >>>>> thoroughly, I'll ask that it be moved to the 1.0 branch. For
> >>>>> now, you
> >>>>> might want to check out the trunk and verify it meets your needs.
> >>>>>
> >>>>> Ralph
> >>>>>
> >>>>> At 03:05 PM 2/1/2006, you wrote:
> >>>>>
> >>>>>
> >>>>>> This was happening on Alpha 1 as well but I upgraded today to
> >>>>>> Alpha 4 to
> >>>>>> see if it's gone away - it has not.
> >>>>>>
> >>>>>> I register a callback on a spawn() inside ORTE. That callback
> >>>>>> includes
> >>>>>> the current state and should be called as the job goes through
> >>>>>> those states.
> >>>>>>
> >>>>>> I am now noticing that jobs never go through the INIT state.
> >>>>>> They may
> >>>>>> also not go through others but definitely not
> >>>>>> ORTE_PROC_STATE_INIT.
> >>>>>>
> >>>>>> I was registering the IOForwarding callback during the INIT phase
> >>>>>> so,
> >>>>>> consequentially, I now do not have IOF. There are other side
> >>>>>> effects
> >>>>>> such as jobs that I start I think are perpetually in the
> >>>>>> 'starting'
> >>>>>> state and then, suddenly, they're done.
> >>>>>>
> >>>>>> Can someone look into / comment on this please?
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> --
> >>>>>> -- Nathan
> >>>>>> Correspondence
> >>>>>> -----------------------------------------------------------------
> >>>>>> ---
> >>>>>> -
> >>>>>> Nathan DeBardeleben, Ph.D.
> >>>>>> Los Alamos National Laboratory
> >>>>>> Parallel Tools Team
> >>>>>> High Performance Computing Environments
> >>>>>> phone: 505-667-3428
> >>>>>> email: ndebard_at_[hidden]
> >>>>>> -----------------------------------------------------------------
> >>>>>> ---
> >>>>>> -
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> devel_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> devel_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/devel