Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Greg Watson (gwatson_at_[hidden])
Date: 2007-01-29 12:20:27


On Jan 29, 2007, at 6:47 AM, Ralph H Castain wrote:

>
>
>
> On 1/27/07 9:37 AM, "Greg Watson" <gwatson_at_[hidden]> wrote:
>
>> There are two more interfaces that have changed:
>>
>> 1. orte_rds.query() now takes a job id, whereas in 1.2b1 it didn't
>> take any arguments. I seem to remember that I call this to kick orted
>> into action, but I'm not sure of the implications of not calling it.
>> In any case, I don't have a job id when I call it, so what do I pass
>> to get the old behavior?
>
> For now, you can just use ORTE_JOBID_INVALID (defined in
> orte/mca/ns/ns_types.h).
>
> However, your question raises a flag. You should be calling
> orte_rmgr.setup_job before you call the RDS, and that function
> returns the
> jobid for your job. Failing to call setup_job first may cause other
> parts of
> the code base to fail as they are expecting certain data to be
> setup in the
> registry by setup_job.
>
> If you do call setup_job first, then just pass the returned jobid
> along to
> rds.query.

No, we have always called query() first, just after orte_init().
Since query() has never required a job id before, this used to work.
I think the call was required to kick the SOH into action, but I'm
not sure if it was needed for any other purpose.

>
>>
>> 2. orte_pls.terminate_job() now takes a list of attributes in
>> addition to a job id. What are the attributes for, and what happens
>> if I pass a NULL here? Do I need to crate an empty attribute list?
>>
>
> You can always pass a NULL to any function looking for attributes -
> the
> system knows how to handle that situation.
>
> What you should pass here depends upon what you are trying to do.
> If you
> just want to terminate a specific job, then you can just pass a NULL.
> However, if you want to terminate the specified job AND any
> "children" that
> were dynamically spawned by that job, then you need to pass the
> ORTE_NS_INCLUDE_DESCENDANTS attribute - something like the
> following code
> snippet (pulled from orterun) would work:
>
> #include "opal/class/opal_list.h"
>
> #include "orte/mca/pls/pls.h"
> #include "orte/mca/rmgr/rmgr.h"
> #include "orte/mca/ns/ns_types.h"
> #include "orte/runtime/params.h"
>
> opal_list_t attrs;
> opal_list_item_t *item;
>
> OBJ_CONSTRUCT(&attrs, opal_list_t);
> orte_rmgr.add_attribute(&attrs, ORTE_NS_INCLUDE_DESCENDANTS,
> ORTE_UNDEF,
> NULL, ORTE_RMGR_ATTR_OVERRIDE);
> ret = orte_pls.terminate_job(jobid, &orte_abort_timeout, &attrs);
> while (NULL != (item = opal_list_remove_first(&attrs)))
> OBJ_RELEASE(item);
> OBJ_DESTRUCT(&attrs);
>
>
> Please note that the orte_pls.terminate_job API in 1.2 will undergo
> a change
> in the next few days (it already is changed in the trunk). The change,
> included in the code snippet above, adds a timeout capability to
> have the
> function "give up" if the job doesn't terminate within the
> specified time.
> The parameter given above references the orte-wide default value
> (adjustable
> via MCA param), but you can give it anything you like - a NULL for the
> timeout param means don't timeout so we'll try until you order us
> to quit.
>

Is this going to be in "1.2b4", or some other version? The previous
API changes mean that PTP will no longer work with pre-1.2b3
versions. It sounds like this is going to cause a similar issue.

Are there likely to be further API changes before the release
version? We are trying to release PTP, but I think this is impossible
until your API's stabilize.

What about orte_ns.free_name()?

Thanks,

Greg