The ft_event() function that you mentioned is part of the larger fault
tolerance infrastructure in Open MPI. You need to make sure to enable
it before using (if it is not enabled many of the ft_event functions
default to NULL). Add '--with-ft=cr' to your ./configure line and that
will enable the FT infrastructure.
As Jeff mentioned you might be able to use the Checkpoint/Restart
Coordination Protocol (CRCP) framework [located in ompi/mca/crcp] to
halt messaging. It works as a wrapper around the PML, so you are
operating on whole MPI messages, not fragments as in the BTLs below.
But it might be another option to consider.
On Jan 11, 2010, at 5:08 PM, Jeff Squyres wrote:
> Additionally, I believe that the FT system already does something
> like what you describe (although perhaps not exactly the same thing)
> -- there is a phase where the FT system pauses and quiesces all BTLs.
> Did you look at that part of the code, perchance, and see if it
> meets your needs?
> On Jan 11, 2010, at 3:53 PM, Christoph Konersmann wrote:
>> Thanks a lot for your help! I will give it a try.
>> Ralph Castain schrieb:
>>> You've got this a tad wrong, but that's okay - let me try to
>>> clarify a couple of things that may help.
>>> First, you don't want to add this as a separate orted command. As
>>> you noted, orte has no direct way to tell the OMPI layer to do
>>> anything. Instead, you want to pass a message to the process that
>>> is received in the OMPI layer. That is easy to do.
>>> 1. add a message tag in ompi/mca/dpm/dpm.h - perhaps something
>>> like OMPI_RML_TAG_BTL_CTL
>>> 2. in the btl, add a call to orte_rml.recv_nb() that identifies
>>> the above tag and specifies a callback function to use when such a
>>> message arrives
>>> 3. in that callback function, toggle your "paused" flag - or you
>>> can unpack the buffer to get a flag telling you what value to set.
>>> Your choice.
>>> Now, when you want to pause the BTL, you do an
>>> orte_grpcomm.xcast() to the above message tag. ORTE will deliver
>>> that message to every process, which will then have its callback
>>> function called.
>> Paderborn Center for Parallel Computing - PC2
>> University of Paderborn - Germany
>> Christoph Konersmann <c_k_at_[hidden]>
>> devel mailing list
> Jeff Squyres
> devel mailing list